19
Introduc)on to Next Genera)on Sequencing 1.5 Billion bases / day 100nG DNA for library

ismb 2010 amitra - University of California, Riversidealumni.cs.ucr.edu/~amitra/bioinfo/ismb_amitra_2010.pdf · Example*of*ChIPISeq*Data:*Lefrançois*etal.*(2009),* Efficientyeast

Embed Size (px)

Citation preview

Page 1: ismb 2010 amitra - University of California, Riversidealumni.cs.ucr.edu/~amitra/bioinfo/ismb_amitra_2010.pdf · Example*of*ChIPISeq*Data:*Lefrançois*etal.*(2009),* Efficientyeast

Introduc)on  to  Next  Genera)on  Sequencing  

1.5  Billion  bases  /  day  

100nG  DNA  for  library  

Page 2: ismb 2010 amitra - University of California, Riversidealumni.cs.ucr.edu/~amitra/bioinfo/ismb_amitra_2010.pdf · Example*of*ChIPISeq*Data:*Lefrançois*etal.*(2009),* Efficientyeast

Mo)va)on  

•  Next  genera)on  sequencing  (NGS)  is  rapidly  becoming  a  method  of  choice  for  many  whole  genome  studies,  especially  so  for  iden)fying  protein-­‐DNA  interac)ons  (ChIP-­‐Seq).    

•  NGS  technology  is  rela)vely  new  and  the  proper)es  of  sequencing  background  are  not  well  understood  yet.  

•  We  need  to  iden)fy  the  cause  of  uneven  distribu)on  of  sequenced  DNA.  

Page 3: ismb 2010 amitra - University of California, Riversidealumni.cs.ucr.edu/~amitra/bioinfo/ismb_amitra_2010.pdf · Example*of*ChIPISeq*Data:*Lefrançois*etal.*(2009),* Efficientyeast

Mo)va)on:  Uneven  Distribu)on  of  Sequenced  DNA  

Source:  Various  S.  cerevisiae  sequences  from  NCBI  SRA.  PloSed  with  UCSC  Genome  Browser  

Plots  of  Genomic  and  Input  DNA  of  S.  cerevisiae,  Chromosome  IV  from  different  experiments  

R  Genomic  DNA.  K  Input  DNA  Exp  1.  B  Input  DNA  Exp2.  

Page 4: ismb 2010 amitra - University of California, Riversidealumni.cs.ucr.edu/~amitra/bioinfo/ismb_amitra_2010.pdf · Example*of*ChIPISeq*Data:*Lefrançois*etal.*(2009),* Efficientyeast

Mo)va)on:  Accuracy  of  Peak  Calling?  

“True  peaks”,  corresponding  to  the  actual  protein-­‐DNA  binding  sides  (in  red)  are  difficult  to  dis)nguish  from  spurious  peaks  found  in  background  (in  green).  

PolII  binding  profile  

Control  /  background  profile  

Example  of  ChIP-­‐Seq  Data:  Lefrançois  et  al.  (2009),  Efficient  yeast  ChIP-­‐Seq  using  mul9plex  short-­‐read  DNA  sequencing,  BMC  Genomics.  

Page 5: ismb 2010 amitra - University of California, Riversidealumni.cs.ucr.edu/~amitra/bioinfo/ismb_amitra_2010.pdf · Example*of*ChIPISeq*Data:*Lefrançois*etal.*(2009),* Efficientyeast

We  an)cipate  that  modeling  the  sequencing  background  as  a  one-­‐point  Poisson  distribu)on,  which  lies  at  the  core  of  almost  all  approaches  to  date,  will  lead  to  significant  systema)c  errors  in  interpreta)on  of  ChIP-­‐Seq  experiments.    

Modeling  of  Sequence  Background    

Page 6: ismb 2010 amitra - University of California, Riversidealumni.cs.ucr.edu/~amitra/bioinfo/ismb_amitra_2010.pdf · Example*of*ChIPISeq*Data:*Lefrançois*etal.*(2009),* Efficientyeast

•  Global  /  Local  Poisson  Model    •  Are  sequence  reads  distributed  along  the  chromosomes  /  locally  within  sliding  windows,  according  to  Poisson  distribu)on?  

•  Formula  for  Poisson  Distribu)on.    

•  Lambda  =  average  sequence  read  density  for  the  whole  genome.  [Total  reads  /  mappable  bases]  

Modeling  of  Sequence  Background    

Page 7: ismb 2010 amitra - University of California, Riversidealumni.cs.ucr.edu/~amitra/bioinfo/ismb_amitra_2010.pdf · Example*of*ChIPISeq*Data:*Lefrançois*etal.*(2009),* Efficientyeast

Method  

•  Iden)fy  35bp  (depends  on  the  sequence  read  lengths)  long  mappable  points  in  the  genome  of  Saccharomyces  cerevisiae.  

•  Input  DNA  reads  were  mapped  to  the  genome.  

•  Simulated  Poisson  model  with  the  number  of  reads.  

•  Compared  Poisson  model  with  actual  background.  

Method  

Page 8: ismb 2010 amitra - University of California, Riversidealumni.cs.ucr.edu/~amitra/bioinfo/ismb_amitra_2010.pdf · Example*of*ChIPISeq*Data:*Lefrançois*etal.*(2009),* Efficientyeast

•  We  Compare  the  fifh  order  moments  of  simula)on  and  actual  ChIP-­‐Seq  reads.  

•   The  p-­‐value  turns  out  to  be  2.36e-­‐9.  

•  Conclusion:  Sequence  reads  are  not  distributed  over  the  Saccharomyces  cerevisiae  genome  as  a  Poisson  distribu)on.  

•  Conclusion:  With  a  500bp  sliding  window;  ~80%  of  the  windows  do-­‐not  contain  reads  distributed  as  a  Poisson  distribu)on.  

Results  

Page 9: ismb 2010 amitra - University of California, Riversidealumni.cs.ucr.edu/~amitra/bioinfo/ismb_amitra_2010.pdf · Example*of*ChIPISeq*Data:*Lefrançois*etal.*(2009),* Efficientyeast

•  Physical  proper)es  of  DNA  in  solu)on  changes,  and  allows  it  to  bend  more  at  certain  places.  

•  Bendability  of  DNA  segments  has  been  quan)zed  by  hydroxyl  radical  cleavage  intensity.      

•  Prior  to  sequencing,  DNA  usually  undergoes  mechanical  shearing  by  means  of  sonica)on.  

•  Some  reads  are  prone  to  get  sequenced  more  others.  

•  Example:    ‘TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTAA’  

‘TTTTTTTTTTTTTTTTTAATTGAACAATAGATGC’  

DNA  Bendability  

hSp://dna.bu.edu/orchid/  (Tullius  and  Greenbaum)  

Page 10: ismb 2010 amitra - University of California, Riversidealumni.cs.ucr.edu/~amitra/bioinfo/ismb_amitra_2010.pdf · Example*of*ChIPISeq*Data:*Lefrançois*etal.*(2009),* Efficientyeast

Method  

•  We  propose  an  alterna)ve  method  of  background  characteriza)on  which  is  non-­‐local  and  takes  into  account  intrinsic  proper)es  of  DNA  that  correlate  with  the  density  of  reads.  

•  We  iden)fy  the  weighted  average  of  bendability  as  well  as  gc  content  at  sequence  start  sites  and  use  these  as  models  to  iden)fy  sites  on  the  genome  which  may  be  preferen)ally  sequenced  in  the  given  experiment.  

•  Using  log-­‐likelihood  we  iden)fy  the  best  model  that  can  predict  sequencing  bias.  

Page 11: ismb 2010 amitra - University of California, Riversidealumni.cs.ucr.edu/~amitra/bioinfo/ismb_amitra_2010.pdf · Example*of*ChIPISeq*Data:*Lefrançois*etal.*(2009),* Efficientyeast

Bendability  at  Sequence  Read  Loca)ons  

Page 12: ismb 2010 amitra - University of California, Riversidealumni.cs.ucr.edu/~amitra/bioinfo/ismb_amitra_2010.pdf · Example*of*ChIPISeq*Data:*Lefrançois*etal.*(2009),* Efficientyeast

GC  Content  at  Sequence  Read  Loca)ons  

Page 13: ismb 2010 amitra - University of California, Riversidealumni.cs.ucr.edu/~amitra/bioinfo/ismb_amitra_2010.pdf · Example*of*ChIPISeq*Data:*Lefrançois*etal.*(2009),* Efficientyeast

Method  and  Conclusion  

•  Use  log-­‐likelihood  to  figure  out  whether  bendability  or  gc  content  is  a  beSer  model  compared  to  a  random  Poisson  model.  

•  Formula  for  likelihood  Func)on  

•  We  compute  the  sum  of  the  logarithms  of  likelihood  values  for    three  probability  distribu)ons  one  each  for  bendability,  gc  content  and  the  random  model.  

•  We  conclude  that  the  bendability  model  serves  as  an  useful  predictor  of  bias  in  sequencing  experiments  within  a  +/-­‐  100bp  offset  from  beginning  of  sequence,  thereafer  the  gc  content  works  beSer.  

Page 14: ismb 2010 amitra - University of California, Riversidealumni.cs.ucr.edu/~amitra/bioinfo/ismb_amitra_2010.pdf · Example*of*ChIPISeq*Data:*Lefrançois*etal.*(2009),* Efficientyeast

Results  for  Genomic  DNA  

The  log-­‐likelihood  for  Poisson  model  was  -­‐1.517e+6.    

Page 15: ismb 2010 amitra - University of California, Riversidealumni.cs.ucr.edu/~amitra/bioinfo/ismb_amitra_2010.pdf · Example*of*ChIPISeq*Data:*Lefrançois*etal.*(2009),* Efficientyeast

Results  for  Input  DNA  Exp.  1  

The  log-­‐likelihood  for  Poisson  model  was  -­‐2.507e+6.    

Page 16: ismb 2010 amitra - University of California, Riversidealumni.cs.ucr.edu/~amitra/bioinfo/ismb_amitra_2010.pdf · Example*of*ChIPISeq*Data:*Lefrançois*etal.*(2009),* Efficientyeast

Results  for  Input  DNA  Exp.  2  

The  log-­‐likelihood  for  Poisson  model  was  -­‐5.945e+6.    

Page 17: ismb 2010 amitra - University of California, Riversidealumni.cs.ucr.edu/~amitra/bioinfo/ismb_amitra_2010.pdf · Example*of*ChIPISeq*Data:*Lefrançois*etal.*(2009),* Efficientyeast
Page 18: ismb 2010 amitra - University of California, Riversidealumni.cs.ucr.edu/~amitra/bioinfo/ismb_amitra_2010.pdf · Example*of*ChIPISeq*Data:*Lefrançois*etal.*(2009),* Efficientyeast

•  Next  genera)on  sequencing  (NGS)  is  rapidly  becoming  a  method  of  choice  for  many  whole  genome  studies,  especially  so  for  iden)fying  protein-­‐DNA  interac)ons  (ChIP-­‐Seq).  NGS  technology  is  rela)vely  new  and  the  proper)es  of  sequencing  background  are  not  well  understood  yet.  It  has  been  shown  [1]  that  sequencing  background  is  highly  repeatable  and  thus,  in  principle,  can  be  modeled  with  high  accuracy.  

•  We  demonstrate  that  modeling  the  sequencing  background  as  a  one-­‐point  Poisson  distribu)on,  which  lies  at  the  core  of  almost  all  approaches  to  date,  will  lead  to  significant  systema)c  errors  in  interpreta)on  of  ChIP-­‐Seq  experiments.  We  propose  an  alterna)ve  method  of  background  characteriza)on  which  is  non-­‐local  and  takes  into  account  intrinsic  proper)es  of  DNA  that  correlate  with  the  density  of  reads.  

•  By  comparing  Chip-­‐seq  backgrounds  obtained  in  the  same  condi)ons  but  using  different  experimental  protocols,  we  iden)fy  and  computa)onally  separate  the  bias  introduced  into  the  results  at  the  different  stages  of  the  sample  prepara)on  and  sequencing  process.  We  discuss  how  this  understanding  can  be  used  to  improve  detec)on  of  genuine  protein-­‐DNA  interac)ons  and  provide  sofware  tools  that  implement  our  approach.  Finally,  we  propose  an  algorithm  based  on  a  mul)parametric  model,  which  can  ab  ini)o  model  the  sequencing  background.  

•  [1].  P  Lefrançois  et  al.:  “Efficient  yeast  ChIP-­‐Seq  using  mul)plex  short-­‐read  DNA  sequencing.”,  (2009)  BMC  Genomics  10:  37.  

Page 19: ismb 2010 amitra - University of California, Riversidealumni.cs.ucr.edu/~amitra/bioinfo/ismb_amitra_2010.pdf · Example*of*ChIPISeq*Data:*Lefrançois*etal.*(2009),* Efficientyeast

Next  Genera)on  Sequencing  

•  Next  genera)on  sequencing  (NGS)  is  rapidly  becoming  a  method  of  choice  for  many  whole  genome  studies,  especially  so  for  iden)fying  protein-­‐DNA  interac)ons  (ChIP-­‐Seq).    

•  NGS  technology  is  rela)vely  new  and  the  proper)es  of  sequencing  background  are  not  well  understood  yet.