View
3
Download
0
Category
Preview:
Citation preview
Introduc)on to Bioinforma)cs of next-‐genera)on sequencing
Sequence acquisi)on and processing; genome mapping and alignment manipula)on
Ruslan Sadreyev
Director of Bioinformatics Department of Molecular Biology, MGH Department of Pathology, MGH, HMS
Next-Generation Sequencing Core and Bioinformatics group at Molecular Biology
5/18/17 NGS analysis 1
sadreyev@molbio.mgh.harvard.edu
Next-‐genera)on sequencing core Website: nextgen.mgh.harvard.edu Email: nextgen@research.mgh.harvard.edu
Bioinforma)cs team Website: molbio.mgh.harvard.edu/department/bioinformaDcs Email: bioinfo@molbio.mgh.harvard.edu
Bring your DNA/RNA!
Bring your data!
2
Founding members of MGH Bioinformatics Consortium
• AnalyDc and TranslaDonal GeneDcs Unit • Biomedical InformaDcs Core • Center for Integrated DiagnosDcs BioinformaDcs Group • BioinformaDcs at the MGH Cancer Center and Department of Pathology • Genomics and Technology Core • ITN/PHS InformaDon Systems/Immune Tolerance Network • BioinformaDcs group at the Department of Molecular Biology • MGH next-‐generaDon sequencing core
5/18/17 NGS analysis 1 sadreyev@molbio.mgh.harvard.edu 3
Missions of Bioinformatics Consortium
Support and collaboration Develop and support the informaDcs component of fundamental, translaDonal, and clinical research projects
Consulting
Plan experimental design and execuDon of data generaDon and analysis; advise on best pracDces
Educational outreach
EducaDon in general BioinformaDcs concepts and methods; helping researchers think about their data in quanDtaDvely rigorous terms
Training Teach hands-‐on computaDonal skills in staDsDcs and BioinformaDcs workflows
5/18/17 NGS analysis 1 sadreyev@molbio.mgh.harvard.edu 4
About this mini-course
5/18/17 NGS analysis 1 sadreyev@molbio.mgh.harvard.edu
You can (a) get oriented in basic NGS BioinformaDcs concepts, approaches, and tools. (b) start asking right quesDons about your data. You cannot get (a) very deep coverage of a specific area/method/applicaDon (b) hands-‐on computaDonal experience
Two seminars: Sequence acquisi)on and processing; genome mapping and alignment manipula)on
Thurs May 18 at 2 pm Specific NGS applica)ons and public datasets
Thurs May 25 at 2 pm
5
I used slides/images/material by …
NGS analysis 1 sadreyev@molbio.mgh.harvard.edu
BF Francis Ouelleae Istvan Albert Mik Black & ChrisDn Print Thomas Keane Illumina
5/18/17 6
NGS analysis 1 sadreyev@molbio.mgh.harvard.edu
A variety of existing NGS technologies
Metzker (2010) Nature Rev GeneDcs 11 5/18/17 7
NGS analysis 1 sadreyev@molbio.mgh.harvard.edu
Sequencing by synthesis (SBS)
Metzker (2010) Nature Rev GeneDcs 11 5/18/17 8
NGS analysis 1 sadreyev@molbio.mgh.harvard.edu
Major NGS applications: examples
• Whole Genome Shotgun Sequencing (WGS) • Targeted/exome sequencing • RNA-‐seq • ChiP-‐seq • Metagenomics (targeted region/ whole genome sequencing) • MANY more
5/18/17 12
NGS analysis 1 sadreyev@molbio.mgh.harvard.edu
Basic workflow
From: Thomas Keane
Randomly shear DNA + end repair + size select
Append sequencing adapters
Layout of library on sequencing slide or wells (e.g. C-‐Bot)
For each library fragment, determine the first N bases at one or both ends of the fragment
Image processing + base calling -‐> bases and quality (FASTQ)
5/18/17 13
Illumina MiSeq: desktop device
Our current output: 1 lane, 10-‐20 M reads per lane
• Fast • Flexible (can do longer reads, up to 500 bp) • ~10x fewer reads than HiSeq • Cheaper per run (but not per read)
ApplicaDons: • Amplicon sequencing • QC before large-‐scale runs • Bacterial genomes • …..
www.illumina.com
NGS analysis 1 sadreyev@molbio.mgh.harvard.edu
Base quality calibration
Phred score: measure of p = Prob(erroneous base call): -‐10log10(p) Q10 = 1 in 10 chance of incorrect base call Q20 = 1 in 100 chance of incorrect base call Q30 = 1 in 1000 chance of incorrect base call Rule of thumb: not useful data If <Q20 Standard assessment: proporDon of bases with score ≥ Q30 Highest Phred scores are typically around Q35-‐40
5/18/17 17
NGS analysis 1 sadreyev@molbio.mgh.harvard.edu
FASTQ format for NGS sequences
hap://en.wikipedia.org/wiki/FASTQ_format 5/18/17 18
NGS analysis 1 sadreyev@molbio.mgh.harvard.edu
Quality control of raw sequences: FASTQC hap://www.bioinformaDcs.babraham.ac.uk/projects/fastqc
5/18/17 19
NGS analysis 1 sadreyev@molbio.mgh.harvard.edu
Good example: per base sequence quality hap://www.bioinformaDcs.babraham.ac.uk/projects/fastqc
5/18/17 20
NGS analysis 1 sadreyev@molbio.mgh.harvard.edu
hap://www.bioinformaDcs.babraham.ac.uk/projects/fastqc
Good example: per base sequence content
5/18/17 21
NGS analysis 1 sadreyev@molbio.mgh.harvard.edu
hap://www.bioinformaDcs.babraham.ac.uk/projects/fastqc
Bad example: per base sequence quality
5/18/17 22
NGS analysis 1 sadreyev@molbio.mgh.harvard.edu
hap://www.bioinformaDcs.babraham.ac.uk/projects/fastqc
Bad example: per base sequence content
5/18/17 23
NGS analysis 1 sadreyev@molbio.mgh.harvard.edu
Short read alignment methods (mappers) hap://wwwdev.ebi.ac.uk/fg/hts_mappers/
Fonseca N A et al. (2012) Bioinformatics 28:3169 5/18/17 25
NGS analysis 1 sadreyev@molbio.mgh.harvard.edu
Two popular mappers • BowDe: hap://bowDe-‐bio.sourceforge.net • BWA: hap://bio-‐bwa.sourceforge.net Both are based on Burrows-‐Wheeler Transform (BWT)
Examples of common fast mappers
5/18/17 26
NGS analysis 1 sadreyev@molbio.mgh.harvard.edu Trapnell & Salzberg (2009) Nature Biotech 27
Burrows-Wheeler transform makes the search for genome matches faster and more memory-efficient
BWT was originally introduced as method for file compression (bzip2)
5/18/17 27
NGS analysis 1 sadreyev@molbio.mgh.harvard.edu
Using BWA: example
• Create index for the genome: bwa index [-a bwtsw|div|is] [-c] <in.fasta>• -‐a STR BWT construc)on algorithm: bwtsw or is • bwtsw for human size genome, is for smaller genomes Create index for reads: bwa aln [options] <prefix> <in.fq>• Align each single end fastq file individually • Various op)ons to change the alignment parameters/scoring matrix/seed length
• Using sai files produced by aln step, produce alignment For paired-‐end reads: bwa sampe [options] <prefix> <in1.sai> <in2.sai> <in1.fq> <in2.fq>
For unpaired reads: bwa samse [-n max_occ] <prefix> <in.sai> <in.fq>
5/18/17 28
NGS analysis 1 sadreyev@molbio.mgh.harvard.edu
Not all reads are alignable
Sources of mismatches: 1. Sequencer miscalls 2. Actual differences between sample and reference: (a) Genome variaDon (not the reference genome) ; (b) ContaminaDon (adapters, primers, different biological
species) … Typical good mappability rate > 70%-‐80%
5/18/17 29
NGS analysis 1 sadreyev@molbio.mgh.harvard.edu
SAM/BAM formats
5/18/17 30
• Recent addition: CRAM is a more compact and efficient binary version of SAM
NGS analysis 1 sadreyev@molbio.mgh.harvard.edu
SAM format information at SAMtools website hap://samtools.sourceforge.net
5/18/17 32
NGS analysis 1 sadreyev@molbio.mgh.harvard.edu
SAM format specifications hap://samtools.sourceforge.net
5/18/17 34
NGS analysis 1 sadreyev@molbio.mgh.harvard.edu
Tools to work with SAM/BAM alignments
Samtools -‐ Sanger/C (hap://samtools.sourceforge.net) Convert SAM <-‐> BAM Sort, index, BAM files Flagstat – summary of the mapping flags Merge mulDple BAM files Rmdup – remove PCR duplicates from the library preparaDon Picard -‐ Broad InsDtute/Java (hap://picard.sourceforge.net) MarkDuplicates, CollectAlignmentSummaryMetrics, CreateSequenceDicDonary, SamToFastq, MeanQualityByCycle, FixMateInformaDon……. Bio-‐SamTool – Perl (hap://search.cpan.org/~lds/Bio-‐SamTools/) Pysam – Python (hap://code.google.com/p/pysam/) …
5/18/17 35
NGS analysis 1 sadreyev@molbio.mgh.harvard.edu
Example: generate and manipulate alignment: commands Unix/Linux/MacOS…
hap://manuals.bioinformaDcs.ucr.edu/home/ ht-‐seq#TOC-‐Rsamtools
# extract specific region of genomesamtools view –b output.sorted.bam chr1:100000-110000 > myregion.bam
5/18/17 37
NGS analysis 1 sadreyev@molbio.mgh.harvard.edu
Viewing data in a rich context on the web: UCSC Genome Browser
5/18/17 39
NGS analysis 1 sadreyev@molbio.mgh.harvard.edu
Viewing data in a rich context on the web: Ensemble browser at EBI
5/18/17 40
NGS analysis 1 sadreyev@molbio.mgh.harvard.edu
Viewing data on your local machine: IGV (Integrative Genomics Viewer)
5/18/17 41
NGS analysis 1 sadreyev@molbio.mgh.harvard.edu
Viewing data on your local machine: IGB (Integrated Genome Browser)
hap://bioviz.org/igb/
5/18/17 43
NGS analysis 1 sadreyev@molbio.mgh.harvard.edu
Wig (Wiggle) format: position-centric data
Yildirim et al. (2011) Nat Struct Mol Biol. 19 5/18/17 44
NGS analysis 1 sadreyev@molbio.mgh.harvard.edu
Storing just genomic intervals: low-resolution but economic
Coordinates are based only on one strand. Standard representaDon of intervals: start < end; even for reverse strand.
Genomic feature (interval): peak, gene, exon, etc. Chromosome start end name score(e.g. peak intensity) strand ….
5/18/17 45
NGS analysis 1 sadreyev@molbio.mgh.harvard.edu
Two de facto standards of coordinate system
GFF format (Sanger): BED format (USCS Browser):
5/18/17 46
NGS analysis 1 sadreyev@molbio.mgh.harvard.edu
Working with genomic intervals: BedTools
hap://code.google.com/p/bedtools
High-‐performance package operaDng on genomic intervals in various file formats: BED, GFF, VCF, SAM, BAM
Easy to download, install, and use in Unix/Linux/MAcOS …
5/18/17 47
NGS analysis 1 sadreyev@molbio.mgh.harvard.edu
BedTools: examples of operations on genomic intervals
5/18/17 49
NGS analysis 1 sadreyev@molbio.mgh.harvard.edu
Galaxy hap://galaxy.psu.edu/
Our Galaxy server at Molbio: hap://galaga.mgh.harvard.edu/galaxy (can be accessed inside Partners network)
5/18/17 51
Recommended