Introduc)ontoBioinformacsof*nextgeneraonsequencing

Preview:

Citation preview

Introduc)on  to  Bioinforma)cs  of  next-­‐genera)on  sequencing      

Sequence  acquisi)on  and  processing;  genome  mapping  and  alignment  manipula)on    

Ruslan Sadreyev

Director of Bioinformatics Department of Molecular Biology, MGH Department of Pathology, MGH, HMS

Next-Generation Sequencing Core and Bioinformatics group at Molecular Biology

5/18/17  NGS  analysis  1    

sadreyev@molbio.mgh.harvard.edu  

Next-­‐genera)on  sequencing  core  Website:  nextgen.mgh.harvard.edu  Email:    nextgen@research.mgh.harvard.edu  

Bioinforma)cs  team  Website:  molbio.mgh.harvard.edu/department/bioinformaDcs  Email:  bioinfo@molbio.mgh.harvard.edu  

Bring  your  DNA/RNA!  

Bring  your  data!  

2  

Founding members of MGH Bioinformatics Consortium

•  AnalyDc  and  TranslaDonal  GeneDcs  Unit  •  Biomedical  InformaDcs  Core  •  Center  for  Integrated  DiagnosDcs  BioinformaDcs  Group  •  BioinformaDcs  at  the  MGH  Cancer  Center  and  Department  of  Pathology  •  Genomics  and  Technology  Core    •  ITN/PHS  InformaDon  Systems/Immune  Tolerance  Network  •  BioinformaDcs  group  at  the  Department  of  Molecular  Biology  •  MGH  next-­‐generaDon  sequencing  core  

5/18/17   NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu   3  

Missions of Bioinformatics Consortium

Support and collaboration Develop  and  support  the  informaDcs  component  of  fundamental,  translaDonal,          and  clinical  research  projects  

Consulting

 Plan  experimental  design  and  execuDon  of  data  generaDon  and  analysis;      advise  on  best  pracDces  

Educational outreach

EducaDon  in  general  BioinformaDcs  concepts  and  methods;    helping  researchers  think  about  their  data  in  quanDtaDvely  rigorous  terms

Training  Teach  hands-­‐on  computaDonal  skills  in  staDsDcs  and  BioinformaDcs  workflows  

5/18/17   NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu   4  

About this mini-course

5/18/17   NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

You  can  (a) get  oriented  in  basic  NGS  BioinformaDcs  concepts,  approaches,  and  tools.  (b) start  asking  right  quesDons  about  your  data.    You  cannot  get  (a)  very  deep  coverage  of  a  specific  area/method/applicaDon    (b)  hands-­‐on  computaDonal  experience  

Two  seminars:  Sequence  acquisi)on  and  processing;  genome  mapping  and  alignment  manipula)on    

 Thurs  May  18  at  2  pm  Specific  NGS  applica)ons  and  public  datasets    

 Thurs  May  25  at  2  pm  

5  

I used slides/images/material by …

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

BF  Francis  Ouelleae  Istvan  Albert  Mik  Black  &  ChrisDn  Print  Thomas  Keane  Illumina  

5/18/17   6  

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

A variety of existing NGS technologies

Metzker  (2010)  Nature  Rev  GeneDcs  11  5/18/17   7  

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

Sequencing by synthesis (SBS)

Metzker  (2010)  Nature  Rev  GeneDcs  11  5/18/17   8  

Paired-end sequencing: looking at both ends of the fragment

Paired-end sequencing: better mapping to genome

©  Illumina,  Inc  

Indexing (barcoding) of multiple samples in a single lane

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

Major NGS applications: examples

•  Whole  Genome  Shotgun  Sequencing  (WGS)  •  Targeted/exome  sequencing  •  RNA-­‐seq  •  ChiP-­‐seq  •  Metagenomics  (targeted  region/  whole  genome  sequencing)  •  MANY  more  

5/18/17   12  

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

Basic workflow

From:  Thomas  Keane  

Randomly  shear  DNA  +  end  repair  +  size  select  

Append  sequencing  adapters  

Layout  of  library  on  sequencing  slide  or  wells  (e.g.  C-­‐Bot)  

For  each  library  fragment,  determine  the  first  N  bases  at  one  or  both  ends  of  the  fragment  

Image  processing  +  base  calling  -­‐>    bases  and  quality  (FASTQ)  

5/18/17   13  

Illumina HiSeq

From  hap://www.qbi.uq.edu.au    Our  current  output:  8  lanes,  150-­‐200  M  reads  per  lane  

Illumina MiSeq: desktop device

Our  current  output:  1  lane,  10-­‐20  M  reads  per  lane  

•  Fast  •  Flexible  (can  do  longer  reads,  up  to  500  bp)  •  ~10x  fewer  reads  than  HiSeq  •  Cheaper  per  run  (but  not  per  read)  

ApplicaDons:  •  Amplicon  sequencing  •  QC  before  large-­‐scale  runs  •  Bacterial  genomes  •  …..  

www.illumina.com  

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

Bioinformatics tools

From:  istvan  Albert  5/18/17   16  

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

Base quality calibration

Phred  score:  measure  of  p  =  Prob(erroneous  base  call):  -­‐10log10(p)    Q10  =  1  in  10  chance  of  incorrect  base  call  Q20  =  1  in  100  chance  of  incorrect  base  call    Q30  =  1  in  1000  chance  of  incorrect  base  call    Rule  of  thumb:  not  useful  data  If  <Q20  Standard  assessment:  proporDon  of  bases  with  score  ≥  Q30    Highest  Phred  scores  are  typically  around  Q35-­‐40  

5/18/17   17  

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

FASTQ format for NGS sequences

hap://en.wikipedia.org/wiki/FASTQ_format  5/18/17   18  

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

Quality control of raw sequences: FASTQC hap://www.bioinformaDcs.babraham.ac.uk/projects/fastqc  

5/18/17   19  

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

Good example: per base sequence quality hap://www.bioinformaDcs.babraham.ac.uk/projects/fastqc  

5/18/17   20  

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

hap://www.bioinformaDcs.babraham.ac.uk/projects/fastqc  

Good example: per base sequence content

5/18/17   21  

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

hap://www.bioinformaDcs.babraham.ac.uk/projects/fastqc  

Bad example: per base sequence quality

5/18/17   22  

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

hap://www.bioinformaDcs.babraham.ac.uk/projects/fastqc  

Bad example: per base sequence content

5/18/17   23  

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

Mapping your reads

5/18/17   24  

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

Short read alignment methods (mappers) hap://wwwdev.ebi.ac.uk/fg/hts_mappers/  

Fonseca N A et al. (2012) Bioinformatics 28:3169 5/18/17   25  

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

 Two  popular  mappers  •  BowDe:  hap://bowDe-­‐bio.sourceforge.net  •  BWA:  hap://bio-­‐bwa.sourceforge.net    Both  are  based  on  Burrows-­‐Wheeler  Transform  (BWT)  

Examples of common fast mappers

5/18/17   26  

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu   Trapnell & Salzberg (2009) Nature Biotech 27

Burrows-Wheeler transform makes the search for genome matches faster and more memory-efficient

BWT  was  originally  introduced    as  method  for  file  compression    (bzip2)  

5/18/17   27  

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

Using BWA: example

•  Create  index  for  the  genome:   bwa index [-a bwtsw|div|is] [-c] <in.fasta>•  -­‐a  STR  BWT  construc)on  algorithm:  bwtsw  or  is  •  bwtsw  for  human  size  genome,  is  for  smaller  genomes    Create  index  for  reads:   bwa aln [options] <prefix> <in.fq>•  Align  each  single  end  fastq  file  individually  •  Various  op)ons  to  change  the  alignment  parameters/scoring  matrix/seed  length  

•  Using  sai  files  produced  by  aln  step,  produce  alignment   For  paired-­‐end  reads:  bwa sampe [options] <prefix> <in1.sai> <in2.sai> <in1.fq> <in2.fq>

For  unpaired  reads:   bwa samse [-n max_occ] <prefix> <in.sai> <in.fq>

5/18/17   28  

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

Not all reads are alignable

Sources  of  mismatches:  1.  Sequencer  miscalls  2.  Actual  differences  between  sample  and  reference:  (a) Genome  variaDon  (not  the  reference  genome)  ;    (b) ContaminaDon  (adapters,  primers,  different  biological  

species)  …    Typical  good  mappability  rate  >  70%-­‐80%  

5/18/17   29  

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

SAM/BAM formats

5/18/17   30  

•  Recent addition: CRAM is a more compact and efficient binary version of SAM

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

SAM/BAM formats

5/18/17   31  

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

SAM format information at SAMtools website hap://samtools.sourceforge.net  

5/18/17   32  

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

Example: Two lines of a SAM file

5/18/17   33  

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

SAM format specifications hap://samtools.sourceforge.net  

5/18/17   34  

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

Tools to work with SAM/BAM alignments

Samtools  -­‐  Sanger/C  (hap://samtools.sourceforge.net)  Convert  SAM  <-­‐>  BAM  Sort,  index,  BAM  files  Flagstat  –  summary  of  the  mapping  flags  Merge  mulDple  BAM  files  Rmdup  –  remove  PCR  duplicates  from  the  library  preparaDon    Picard  -­‐  Broad  InsDtute/Java  (hap://picard.sourceforge.net)  MarkDuplicates,  CollectAlignmentSummaryMetrics,  CreateSequenceDicDonary,  SamToFastq,  MeanQualityByCycle,  FixMateInformaDon…….    Bio-­‐SamTool  –  Perl  (hap://search.cpan.org/~lds/Bio-­‐SamTools/)    Pysam  –  Python  (hap://code.google.com/p/pysam/)  …  

5/18/17   35  

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

SAMTools

5/18/17   36  

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

Example: generate and manipulate alignment: commands Unix/Linux/MacOS…

hap://manuals.bioinformaDcs.ucr.edu/home/  ht-­‐seq#TOC-­‐Rsamtools  

# extract specific region of genomesamtools view –b output.sorted.bam chr1:100000-110000 > myregion.bam

5/18/17   37  

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

Visualizing your data

5/18/17   38  

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

Viewing data in a rich context on the web: UCSC Genome Browser

5/18/17   39  

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

Viewing data in a rich context on the web: Ensemble browser at EBI

5/18/17   40  

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

Viewing data on your local machine: IGV (Integrative Genomics Viewer)

5/18/17   41  

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

Viewing data on your local machine: IGV

5/18/17   42  

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

Viewing data on your local machine: IGB (Integrated Genome Browser)

hap://bioviz.org/igb/  

5/18/17   43  

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

Wig (Wiggle) format: position-centric data

Yildirim  et  al.  (2011)  Nat  Struct  Mol  Biol.  19    5/18/17   44  

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

Storing just genomic intervals: low-resolution but economic

Coordinates  are  based  only  on  one  strand.  Standard  representaDon  of  intervals:  start  <  end;  even  for  reverse  strand.    

Genomic  feature  (interval):  peak,  gene,  exon,  etc.  Chromosome      start      end      name      score(e.g.  peak  intensity)      strand      ….  

5/18/17   45  

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

Two de facto standards of coordinate system

GFF  format  (Sanger):   BED  format  (USCS  Browser):  

5/18/17   46  

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

Working with genomic intervals: BedTools

hap://code.google.com/p/bedtools  

High-­‐performance  package  operaDng  on  genomic  intervals    in  various  file  formats:  BED,  GFF,  VCF,  SAM,  BAM  

Easy  to  download,  install,  and  use  in  Unix/Linux/MAcOS  …  

5/18/17   47  

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

BedTools: choice of many operations

5/18/17   48  

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

BedTools: examples of operations on genomic intervals

5/18/17   49  

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

Galaxy hap://galaxy.psu.edu/  

5/18/17   50  

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

Galaxy hap://galaxy.psu.edu/  

Our  Galaxy  server  at  Molbio:  hap://galaga.mgh.harvard.edu/galaxy    (can  be  accessed  inside  Partners  network)  

5/18/17   51  

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

Galaxy

From:  Mik  Black  &  ChrisDn  Print  5/18/17   52  

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

Galaxy

From:  Mik  Black  &  ChrisDn  Print  5/18/17   53  

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

Galaxy tools

5/18/17   54  

NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

Galaxy tools

5/18/17   55  

Schedule

5/18/17   NGS  analysis  1    sadreyev@molbio.mgh.harvard.edu  

Sequence acquisition and processing; genome mapping and alignment manipulation

Thurs May 18 at 2 pm Specific NGS applications and public datasets

Thurs May 25 at 2 pm

56  

Recommended