31
NGS Assembly and RNAseq Manpreet S. Katari

Katari NGS assembly rna seq - CGIARhpc.ilri.cgiar.org/.../Katari_NGS_assembly_rna_seq.pdf · Katari_NGS_assembly_rna_seq Author: Manpreet Katari Created Date: 8/28/2015 8:38:40 AM

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Katari NGS assembly rna seq - CGIARhpc.ilri.cgiar.org/.../Katari_NGS_assembly_rna_seq.pdf · Katari_NGS_assembly_rna_seq Author: Manpreet Katari Created Date: 8/28/2015 8:38:40 AM

NGS  Assembly  and  RNA-­‐seq

Manpreet  S.  Katari

Page 2: Katari NGS assembly rna seq - CGIARhpc.ilri.cgiar.org/.../Katari_NGS_assembly_rna_seq.pdf · Katari_NGS_assembly_rna_seq Author: Manpreet Katari Created Date: 8/28/2015 8:38:40 AM

Outline• Fastq – File  format  widely  used  to  provide  sequence  with  a  quality  score  for  each  base.

• Sequence  assembly–What  coverage  is  acceptable?

• RNA-­‐seq– How  to  align  the  sequences?–What  questions  can  we  ask  using  RNA-­‐seq?

Page 3: Katari NGS assembly rna seq - CGIARhpc.ilri.cgiar.org/.../Katari_NGS_assembly_rna_seq.pdf · Katari_NGS_assembly_rna_seq Author: Manpreet Katari Created Date: 8/28/2015 8:38:40 AM

Fastq formatRead  Identifier

Read  Sequence

Read  Sequence  Quality

Page 4: Katari NGS assembly rna seq - CGIARhpc.ilri.cgiar.org/.../Katari_NGS_assembly_rna_seq.pdf · Katari_NGS_assembly_rna_seq Author: Manpreet Katari Created Date: 8/28/2015 8:38:40 AM

Output  of  Bowtie  Alignment  (SAM)

Page 5: Katari NGS assembly rna seq - CGIARhpc.ilri.cgiar.org/.../Katari_NGS_assembly_rna_seq.pdf · Katari_NGS_assembly_rna_seq Author: Manpreet Katari Created Date: 8/28/2015 8:38:40 AM

Bowtie  output  (SAM)1. HYYD8:00007:000872. 163. gb|CM0004554. 13851175. 36. 29M1D9M1D9M2D21M2D18M1D70M7. *8. 09. 010. CAATGAGCTAACAACTGCAATGGGGCCATAATGGCTGCTTGTCGTTTGGCACGTACATGGACTAGCTTCCCCCGTGGCACAAAAAT

GGCTCTACGTTCTGTTACGAGCGCACCTACTGAAGGTCTCTCATAGGAGTGTATGTATATGCATATACAT11. ;:=>>:333*33,33<<:7:3*344,444-­‐449>>::4-­‐6666B<EB>ABA@?;::44,4444<<4,4*555545-­‐

??670??==?<?@?>>>><7<<45-­‐??>>?>>>??;<44444-­‐5,:;;<776767-­‐55?667?=@@888@AA@?<>;<5512. AS:i:-­‐58                XN:i:0    XM:i:4    XO:i:5    XG:i:7    NM:i:11  MD:Z:29^A9^T9^TG10C0T1G0A6^CC18^A70          YT:Z:UU  

XR:Z:@HYYD8%3A00007%3A00087%0AATGTATATGCATATACATACACTCCTATGAGAGACCTTCAGTAGGTGCGCTCGTAACAGAACGTAGAGCCATTTTTGTGCCACGGGGGAAGCTAGTCCATGTACGTGCCAAACGACAAGCAGCCATTATGGCCCCATTGCAGTTGTTAGCTCATTG%0A+%0A55<;><?@AA@888@@=?766?55-­‐767677<;;%3A,5-­‐44444<;??>>>?>>??-­‐54<<7<>>>>?@?<?==??076??-­‐545555*4,4<<4444,44%3A%3A;?@ABA>BE<B6666-­‐4%3A%3A>>944-­‐444,443*3%3A7%3A<<33,33*333%3A>>=%3A;%0A

Page 6: Katari NGS assembly rna seq - CGIARhpc.ilri.cgiar.org/.../Katari_NGS_assembly_rna_seq.pdf · Katari_NGS_assembly_rna_seq Author: Manpreet Katari Created Date: 8/28/2015 8:38:40 AM

CIGAR  string

29M    1D    9M    1D    9M    2D    21M    2D    18M    1D    70M

Page 7: Katari NGS assembly rna seq - CGIARhpc.ilri.cgiar.org/.../Katari_NGS_assembly_rna_seq.pdf · Katari_NGS_assembly_rna_seq Author: Manpreet Katari Created Date: 8/28/2015 8:38:40 AM

Outline• Fastq – File  format  widely  used  to  provide  sequence  with  a  quality  score  for  each  base.

• Sequence  assembly–What  coverage  is  acceptable?

• RNA-­‐seq– How  to  align  the  sequences?–What  questions  can  we  ask  using  RNA-­‐seq?

Page 8: Katari NGS assembly rna seq - CGIARhpc.ilri.cgiar.org/.../Katari_NGS_assembly_rna_seq.pdf · Katari_NGS_assembly_rna_seq Author: Manpreet Katari Created Date: 8/28/2015 8:38:40 AM

Genome Assembly & Annotation

Page 9: Katari NGS assembly rna seq - CGIARhpc.ilri.cgiar.org/.../Katari_NGS_assembly_rna_seq.pdf · Katari_NGS_assembly_rna_seq Author: Manpreet Katari Created Date: 8/28/2015 8:38:40 AM

Whole-genome shotgun sequencing summary

Page 10: Katari NGS assembly rna seq - CGIARhpc.ilri.cgiar.org/.../Katari_NGS_assembly_rna_seq.pdf · Katari_NGS_assembly_rna_seq Author: Manpreet Katari Created Date: 8/28/2015 8:38:40 AM

Shatz et  al.  Genome  Research  2010,  Analysis  of  large  genomes

Comparison of overlap graph and de Brujin graph for assembly

Page 11: Katari NGS assembly rna seq - CGIARhpc.ilri.cgiar.org/.../Katari_NGS_assembly_rna_seq.pdf · Katari_NGS_assembly_rna_seq Author: Manpreet Katari Created Date: 8/28/2015 8:38:40 AM

Using  “pair-­‐mate”  reads  to  connect  contigs

Page 12: Katari NGS assembly rna seq - CGIARhpc.ilri.cgiar.org/.../Katari_NGS_assembly_rna_seq.pdf · Katari_NGS_assembly_rna_seq Author: Manpreet Katari Created Date: 8/28/2015 8:38:40 AM

• Process  of  assembling   raw  sequence   reads  into  accurate  contiguous  sequence– Required   to  achieve  

1/10,000  accuracy• Manual  process

– Look  at  sequence   reads  at  positions  where  programs  can’t  tell  which  base  is  the  correct  one

– Fill  gaps– Ensure  adequate  coverage

GapSingle

stranded

Finishing I

Page 13: Katari NGS assembly rna seq - CGIARhpc.ilri.cgiar.org/.../Katari_NGS_assembly_rna_seq.pdf · Katari_NGS_assembly_rna_seq Author: Manpreet Katari Created Date: 8/28/2015 8:38:40 AM

• To  fill  gaps  in  sequence,  design  primers  and  sequence  from  primer

• To  ensure  adequate  coverage,  find  regions  where  there  is  not  sufficient  coverage  and  use  specific  primers  for  those  areas

GAP

Primer

Primer

Finishing II

Page 14: Katari NGS assembly rna seq - CGIARhpc.ilri.cgiar.org/.../Katari_NGS_assembly_rna_seq.pdf · Katari_NGS_assembly_rna_seq Author: Manpreet Katari Created Date: 8/28/2015 8:38:40 AM

Each  nucleotide  is  sequenced  many  times

Assembly Progression(Macro View)

Page 15: Katari NGS assembly rna seq - CGIARhpc.ilri.cgiar.org/.../Katari_NGS_assembly_rna_seq.pdf · Katari_NGS_assembly_rna_seq Author: Manpreet Katari Created Date: 8/28/2015 8:38:40 AM

Outline• Fastq – File  format  widely  used  to  provide  sequence  with  a  quality  score  for  each  base.

• Sequence  assembly–What  coverage  is  acceptable?

• RNA-­‐seq– How  to  align  the  sequences?–What  questions  can  we  ask  using  RNA-­‐seq?

Page 16: Katari NGS assembly rna seq - CGIARhpc.ilri.cgiar.org/.../Katari_NGS_assembly_rna_seq.pdf · Katari_NGS_assembly_rna_seq Author: Manpreet Katari Created Date: 8/28/2015 8:38:40 AM
Page 17: Katari NGS assembly rna seq - CGIARhpc.ilri.cgiar.org/.../Katari_NGS_assembly_rna_seq.pdf · Katari_NGS_assembly_rna_seq Author: Manpreet Katari Created Date: 8/28/2015 8:38:40 AM

Transcriptomics  using  RNA-­‐seq

Page 18: Katari NGS assembly rna seq - CGIARhpc.ilri.cgiar.org/.../Katari_NGS_assembly_rna_seq.pdf · Katari_NGS_assembly_rna_seq Author: Manpreet Katari Created Date: 8/28/2015 8:38:40 AM

Questions  that  can  be  addressed  with  genome-­‐wide  expression  analysis:

• What  genes  have  similar  function?• What  regulatory  pathways  exist?• Can  we  subdivide  experiments  or  genes  into  meaningful  classes?

• Can  we  correctly  classify  an  unknown  experiment  or  gene  into  a  known  class?

• Can  we  make  better  treatment  decisions  for  a  cancer  patient  based  on  his  or  her  gene  expression  profile?

Page 19: Katari NGS assembly rna seq - CGIARhpc.ilri.cgiar.org/.../Katari_NGS_assembly_rna_seq.pdf · Katari_NGS_assembly_rna_seq Author: Manpreet Katari Created Date: 8/28/2015 8:38:40 AM

RNA-­‐seq  provides  even  more

Page 20: Katari NGS assembly rna seq - CGIARhpc.ilri.cgiar.org/.../Katari_NGS_assembly_rna_seq.pdf · Katari_NGS_assembly_rna_seq Author: Manpreet Katari Created Date: 8/28/2015 8:38:40 AM

Candidate  new  and  revised  exons

Page 21: Katari NGS assembly rna seq - CGIARhpc.ilri.cgiar.org/.../Katari_NGS_assembly_rna_seq.pdf · Katari_NGS_assembly_rna_seq Author: Manpreet Katari Created Date: 8/28/2015 8:38:40 AM

Bowtie  &

TopHat

Page 22: Katari NGS assembly rna seq - CGIARhpc.ilri.cgiar.org/.../Katari_NGS_assembly_rna_seq.pdf · Katari_NGS_assembly_rna_seq Author: Manpreet Katari Created Date: 8/28/2015 8:38:40 AM

Normalizing  the  Data

• RPKM  (Reads  per  Kilobase  of  exons  per  million  reads)

Score  =                      R

R  =  #  of  unique  reads  for  the  geneN  =  Size  of  the  gene  (sum  of  exons  /  1000)T  =  total  number  of  reads   in  the  library  mapped  to  the  genome   /  1,000,000

NT

Recent  studies  show  that  it  is  notNecessary  to  control  for  size  of  genesSo  most  only  control  for  T.

Page 23: Katari NGS assembly rna seq - CGIARhpc.ilri.cgiar.org/.../Katari_NGS_assembly_rna_seq.pdf · Katari_NGS_assembly_rna_seq Author: Manpreet Katari Created Date: 8/28/2015 8:38:40 AM

Test  using  a  negative  binomial  model  [glm.nb()]

p-­value  =  0.258   p-­value  =  1.03e-­05  

x y x y

0200

600

1000

Page 24: Katari NGS assembly rna seq - CGIARhpc.ilri.cgiar.org/.../Katari_NGS_assembly_rna_seq.pdf · Katari_NGS_assembly_rna_seq Author: Manpreet Katari Created Date: 8/28/2015 8:38:40 AM

Volcano  plotfold-­change  vs.  significance

-­log  (p-­value)

Log  ratio

p=10-­2

p=10-­3

p=10-­18

Page 25: Katari NGS assembly rna seq - CGIARhpc.ilri.cgiar.org/.../Katari_NGS_assembly_rna_seq.pdf · Katari_NGS_assembly_rna_seq Author: Manpreet Katari Created Date: 8/28/2015 8:38:40 AM

Clustering  (genes)    p points   in  a  T-­‐dimensional   space  (  p =  #  of  genes,  T =  #  of  conditions   )

based  on  proximity  of  the  points:

• Extract  a  few  typical  expression   patterns   (cluster  centroids)• Partition  genes  based  on  their  profile  similarity   (clusters,  memberships)

Genes  with  similar  expression  profiles  are  likely  to  have  common  or  related  functions,  and  possibly  to  be  co-­‐regulated

T = 3

Similarly,  conditions  can  be  classified   into  different   groups  based  on  similarities   in  their  expression   profiles   (all  or  subsets  of  genes).

Page 26: Katari NGS assembly rna seq - CGIARhpc.ilri.cgiar.org/.../Katari_NGS_assembly_rna_seq.pdf · Katari_NGS_assembly_rna_seq Author: Manpreet Katari Created Date: 8/28/2015 8:38:40 AM

Hierarchical  Clustering

This  example  illustrates  single-­‐linkage  clustering  in  Euclidean  space  on  6  points.

• Find  the  pair(s)  with  the  highest  pairwise  similarity• Join  these  as  a  group  and  calculate  an  “average”   profile(single,  average,  or  complete  linkage)• Iteratively   join  groups  until  all  are  linked

The  UPGMA  method  of  phylogenetic  reconstruction  usesaverage  linking  …

AB

CD

E F

A     B            C              D                E     F

Page 27: Katari NGS assembly rna seq - CGIARhpc.ilri.cgiar.org/.../Katari_NGS_assembly_rna_seq.pdf · Katari_NGS_assembly_rna_seq Author: Manpreet Katari Created Date: 8/28/2015 8:38:40 AM

End  Result

Genes  are  grouped  according  to  similarities   in  their  expression   levels  across  a  variety  of  conditions.

Conditions

Genes

(clustered  by  sim

ilarity  in  

expressio

n  profiles)

• Place  genes  with  similar  expression  profiles   into  clusters.

• Similarity   is  defined  by  Pearson  correlation.

Page 28: Katari NGS assembly rna seq - CGIARhpc.ilri.cgiar.org/.../Katari_NGS_assembly_rna_seq.pdf · Katari_NGS_assembly_rna_seq Author: Manpreet Katari Created Date: 8/28/2015 8:38:40 AM

Gene  Set  Enrichment

• Often  when  we  characterize  this  list  of  genes,  we  use  statistics  to  show  that  the  property  or  annotation  is  significantly  over-­‐represented  compared  to  if  the  list  was  created  randomly.

• Two  of  the  common  statistical  methods  are  :– Hypergeometric  Test– Fisher’s  exact  test.

Page 29: Katari NGS assembly rna seq - CGIARhpc.ilri.cgiar.org/.../Katari_NGS_assembly_rna_seq.pdf · Katari_NGS_assembly_rna_seq Author: Manpreet Katari Created Date: 8/28/2015 8:38:40 AM

Gene  Ontology

• “The  Gene  Ontology  (GO)  project  is  a  collaborative  effort  to  address  the  need  for  consistent  descriptions  of  gene  products  in  different  databases.”

• “The  GO  project  has  developed  three  structured  controlled  vocabularies  (ontologies)  that  describe  gene  products  in  terms  of  their  associated  biological  processes,  cellular  components  and  molecular  functions  in  a  species-­‐independent  manner.”– “A  gene  product  might  be  associated  with  or  located  in  one  or  

more  cellular   components;   it  is  active  in  one  or  more  biological  processes,   during  which  it  performs  one  or  more  molecular  functions.”

Page 30: Katari NGS assembly rna seq - CGIARhpc.ilri.cgiar.org/.../Katari_NGS_assembly_rna_seq.pdf · Katari_NGS_assembly_rna_seq Author: Manpreet Katari Created Date: 8/28/2015 8:38:40 AM

Go  is  a  directed  acyclic  graph

Page 31: Katari NGS assembly rna seq - CGIARhpc.ilri.cgiar.org/.../Katari_NGS_assembly_rna_seq.pdf · Katari_NGS_assembly_rna_seq Author: Manpreet Katari Created Date: 8/28/2015 8:38:40 AM

Hypergeometric Test• The  hypergeometric distribution  is  a  discrete  probability  

distribution  that  describes  the  number  of  successes  in  a  sequence  of  n draws  from  a  finite  population  without  replacement.

• Think  of  an  urn  with  two  types  of  marbles,  blue  and  red  where  blue  is  success  and  red  is  failure.  Stand  next  to  the  urn  with  your  eyes  closes  and  select  10  marbles  without  replacement.  What  is  the  probability  that  4  of  the  10  will  be  blue?