27
Xander – Gene Targeted Metagenomic Assembler Xander – Gene Targeted Metagenomic Assembler Qiong Wang Center for Microbial Ecology Dept. of Plant, Soil and Microbial Sciences Michigan State University June 4 th , 2015 1

RDP Release 11 - Xander assembler 022015rdp.cme.msu.edu/download/posters/Xander_assembler_022015.pdf · 2015. 6. 5. · hp://rdp.cme.msu.edu SampleName ’ Corn’ Miscanthus’ Switchgrass’

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

  • Xander – Gene Targeted Metagenomic Assembler

    Xander  –  Gene  Targeted    Metagenomic  Assembler  

     Qiong Wang

    Center  for  Microbial  Ecology  Dept.  of  Plant,  Soil  and  Microbial  Sciences  

    Michigan  State  University

    June  4th,  2015  

    1  

  • Xander – Gene Targeted Metagenomic Assembler

    •  Genome  assembly  – Repeats  are  a  major  problem  – Short  reads  with  different  error  profiles  

    •  Metagenomic  bulk  assembly  – Assuming  same  abundance,  but  not  true  in  metagenomic  data  

    – Metagenomes  are  highly  diverse    – Big  data,  space  and  Ime  complexity,  need  to  discard  low  abundance  reads  before  assembly  

    2  

    Genome  Assembly  

  • Xander – Gene Targeted Metagenomic Assembler

    Profile  Hidden  Markov  Models  

    •  Widely  used  in  many  fields    –  e.g.  voice  recogniIon  

    •  Protein  and  Nucleic  Acid  HMM  –  Powerful  gene  –  search  and  assignment  tool  – MulIple  sequence  alignments  

    •  ProbabilisIc  models  on  linear  system,  changes  states  according  to  a  transiIon  rule  

    •  only  depends  on  the  current  state,  independent  of  any  other  states  

    •  A  profile  HMM  has  three  states,  7  transiIons  between  states,  transiIon  and  emission  probabiliIes  

  • Xander – Gene Targeted Metagenomic Assembler

    Protein  Profile  Hidden  Markov  Model  Add  insert  states  for  extra  residues  

    Insert  state  Match  state  Delete  state  

    I L R K V −

  • Xander – Gene Targeted Metagenomic Assembler

    Xander:  Gene-‐Targeted  Assembler  Combining  de  Bruijn  Graph  and  HMM  

    de Bruijn Graph

    Profile Hidden Markov Model

    D 57

    ccggga

    ccgagc

    M 56

    I 56

    D 56

    I 57

    D 58

    M 58

    Xander combined weighted

    assembly graph

    M gag ccg

    57

    M 57

    gagccg

    I ccg gga

    57

    M ccg gga

    58

    I ccg agc

    57

    M ccg agc

    58

    D gag ccg

    58

    Wang  et  al.,  2015,  Xander:  Employing  a  Novel  Method  for  Efficient  Gene-‐Targeted  Metagenomic  Assembly.  In  revision  

  • Xander – Gene Targeted Metagenomic Assembler

    HMM-‐Guided  Graph  Search  

  • Xander – Gene Targeted Metagenomic Assembler

    HMP  Defined  Community  

    hMp://rdp.cme.msu.edu  

    Organism  Name   Strain   Accession  Number  Streptococcus  mutans   NN2025  DNA   NC_013928  (AP010655)  †  Listeria  monocytogenes   L99  serovar  4a   NC_003210  (FM211688)†  Acinetobacter  baumannii     ATCC  17978   NC_009085.1  (CP000521)  AcJnomyces  odontolyJcus     ATCC  17982     DS264586.1  Bacillus  cereus     ATCC  10987   AE017194.1  Bacteroides  vulgatus     ATCC  8482   CP000139.1  Candida  albicans*     SC5314  Assembly  21   N/A  Clostridium  beijerinckii     NCIMB  8052   CP000721.1  Deinococcus  radiodurans     R1  chromosome  1   AE000513.1  Enterococcus  faecalis     OG1RF  chromosome   CP002621.1  Escherichia  coli     K12   NC_000913.2  Helicobacter  pylori     26695   NC_000915.1  Lactobacillus  gasseri     ATCC  33323   NC_008530.1  Methanobrevibacter  smithii*   ATCC  35061   NC_009515.1  Neisseria  meningiJdis     MC58   NC_003112.2  Propionibacterium  acnes     KPA171202   NC_006085.1  Pseudomonas  aeruginosa     PAO1   NC_002516.2  Rhodobacter  sphaeroides   2.4.1  chromosome  1   NC_007493.1  Staphylococcus  aureus    subsp.  aureus     USA300  TCH1516   NC_010079.1  Staphylococcus  epidermidis     ATCC  12228   NC_004461.1  Streptococcus  agalacJae     2603V/R   NC_004116.1  Streptococcus  pneumoniae     TIGR4   NC_003028.3  

  • Xander – Gene Targeted Metagenomic Assembler

    Xander  ValidaSon  

    8  

    Dataset:  HMP  defined  community  data  (SRR172902,  SRR172903),  1,037  Mbp  of  length  75  bp  Illumina  reads  Conclusion:  kmer  length  45,  prune  20  and  Count  1  works  well        Count:  minimum  occurrence  of  kmers  to  be  included  in  the  graph  Prune:  stop  the  search  if  score  has  not  improved  in  #  of  verIces  Accuracy  measurements:  1.  Number  of  errors  found  2.  Number  of  chimeric  conIgs  formed  

  • Xander – Gene Targeted Metagenomic Assembler

    Comparison  to  SAT-‐Assembler    SAT-‐Assembler  (Zhang  Y  et  al.,  PLoS  Comput.  Biology.  2014)  Target  gene:  50S  ribosomal  subunit  protein  L2  (rplB)  (average  length  825  bp)  Xander  was  run  with  prune  20,  kmer  45  and  count  1  sekng.  •  Xander  recovered  full  or  near  full-‐length  (94.6%)  of  4  HMP  defined  members.  •  SAT  only  recovered  79.9%  of  the  3  members.  All  conIgs  missed  both  ends.  Sample   HMP   HMP  &  Corn  

    14.5  M  reads   24.7  M  reads  

    SAT   Xander   SAT   Xander  

    #  contigs   4   6   *  

    #  members  recovered   3   4    *   9  

    Median  gene  coverage  (%)   75.7   94.6   *     100  

    Max  gene  coverage(%)   79.9   100   *     100  

    Median  %  nucleoIde  idenIty   97.8   99.8   *     90.3  

    Max  %  nucleoIde  idenIty   99.8   100   *     100  

    Time  (min)   12a     5a   *     738b  *  SAT  did  not  complete  a1er  100  h.    a  on  iMac,  3.2  GHz  Intel  Core  i5                                                      b  MSU  HPCC  network  drive      

  • Xander – Gene Targeted Metagenomic Assembler

    amoA:  ammonia  monooxygenase  nifH:  nitrogen  fixaIon  nirK/  nirS:  nitrite  reductase  norB:  nitric  oxide  reductase  nosZ:  nitrous  oxide  reductase  rplB:  50S  ribosomal  subunit  protein  L2  

    Biofuel  Crops  and  Nitrogen  Cycling  Genes  

    Miscanthus  Switchgrass  Corn  

  • Xander – Gene Targeted Metagenomic Assembler

    Rhizosphere  Soil  Data,  Bulk  Assembly,  nirK  

    hMp://rdp.cme.msu.edu  

    Sample  Name   Corn   Miscanthus   Switchgrass  File  size  (GB)     349   325   277  Data  size  (Gbp)   293   275   233  

    #  protein  conIg  clusters  (99%)   41     37     39    

    #  OTUs  at  95%  aa  idenIty   38   33   34  

    Median  length  (aa)   131   115   130  Max  length  (aa)   234   252   301  

    Median  %  aa  idenIty   75.6   79.6   73.3  

    Max  %  aa  idenIty   95.1   94.3   92.1  #  reads  covering  kmers   105   123   106  Gene  Abundance   0.25   0.25   0.3  

    7  replicates  from  each  crop  from  KBS  intensive  site    one  sample  per  lane  of  Illumina  HiSeq,  replicates  were  pooled  before  assembly  Using  Khmer  protocol  (  provided  by  Jiarong  Guo,  Howe  et  al.,  2014.  PNAS)  

  • Xander – Gene Targeted Metagenomic Assembler

    Rhizosphere  Soil  Data,  Xander  Assembly  

    hMp://rdp.cme.msu.edu  

    Gene   nirK   nifH   rplB  Crop   C   M   S   C   M   S   C   M   S  

    # chimeric clusters   16   207   11   0   1   0   14   28   44  

    # protein contig clusters   1993   1807   1581   39   57   41   19287   20463   17334  

    # OTUs at 95% aa identity   741   674   582   14   24   17   6100   6887   6004  

    Median (aa)   215   230   208   294   256   255   274   274   274  

    Longest (aa)   380   372   370   296   296   296   285   285   284  

    Median % aa identity   88.3   84.7   87.8   92.7   91.9   91.6   77.7   75.8   76.3  

    Max % aa identity   100   99.4   98.6   100   100   100   100   100   100  

    # reads covering kmers   27404   19815   16661   411   534   461   225985   179867   149661  

    Gene Abundance 0.121   0.11   0.111   0.002   0.003   0.003  

  • Xander – Gene Targeted Metagenomic Assembler

    Rhizosphere  Soil  Data,  Xander  Assembly  

    hMp://rdp.cme.msu.edu  

    Gene   nirK   nifH   rplB  Crop   C   M   S   C   M   S   C   M   S  

    # chimeric clusters   16   207   11   0   1   0   14   28   44  

    # protein contig clusters   1993   1807   1581   39   57   41   19287   20463   17334  

    # OTUs at 95% aa identity   741   674   582   14   24   17   6100   6887   6004  

    Median (aa)   215   230   208   294   256   255   274   274   274  

    Longest (aa)   380   372   370   296   296   296   285   285   284  

    Median % aa identity   88.3   84.7   87.8   92.7   91.9   91.6   77.7   75.8   76.3  

    Max % aa identity   100   99.4   98.6   100   100   100   100   100   100  

    # reads covering kmers   27404   19815   16661   411   534   461   225985   179867   149661  

    Gene Abundance 0.121   0.11   0.111   0.002   0.003   0.003  

    Use  rplB  gene  to  normalize  gene  abundance  Read  RaIo:      #  reads  covering  kmers  in  gene  coIgs  /                                                  #  reads  covering  kmers  in  rplB  conIgs  

  • Xander – Gene Targeted Metagenomic Assembler

    nirK  Kmer  Abundance  

    14  

    Kmer  abundance  of  nitrite  reductase  gene  (nirK)  representaIve  conIgs  assembled  by  Xander  from  the  pooled  rhizosphere  samples.  More  than  35%  of  kmers  of  length  45  in  the  conIgs  occurred  only  once  in  the  reads  

    Fra

    ctio

    n of

    Km

    ers

    Kmer Abundance 1 21 41 61 81 101 121 141 161 181 201

    1x10-5

    1x10-6

    1x10-3

    1x10-4

    1x10-2

    1x10+0

    1x10-1Corn

    SwitchgrassMiscanthus

  • Xander – Gene Targeted Metagenomic Assembler

    Mean  Kmer  Coverage  

    15  

    1  

    10  

    100  

    1000  

    10000  

    15  

    Num

    ber  o

    f  Con

    Sgs  

    Mean  Kmer  Coverage  

    Corn  

    Miscanthus  

    Switchgrass  

    1  

    10  

    100  

    1000  

    11  

    Num

    ber  o

    f  Con

    Sgs  

    Mean  Kmer  Coverage  

    nirk   rplB  

    Mean  kmer  coverage  of  a  conIg:  mean  number  of  reads  containing  each  kmer  in  a  conIg.  Counts  for  kmers  that  occurred  in  mulIple  conIgs  were  equally  divided.  RepresentaIve  conIgs  were  chosen  from  clusters  at  99%  aa  idenIty  

  • Xander – Gene Targeted Metagenomic Assembler

    Taxonomic  Abundance  rplB    

    16  

    Xander,  rplB   Shotgun,  16S  

    Acidobacteria  has  few  (

  • Xander – Gene Targeted Metagenomic Assembler

    Taxonomic  Abundance  nirK  

    17  

    0%  

    10%  

    20%  

    30%  

    40%  

    50%  

    60%  

    70%  

    80%  

    90%  

    100%  

    Corn   Miscanthus  Switchgrass  

    Percen

    t  Abu

    ndan

    ce  

    Fungi  Thermobaculum  Firmicutes  Spirochaetes  Bacteroidetes  Environmental  Chloroflexi  Verrucomicrobia  Deltaproteobacteria  Gammaproteobacteria  Betaproteobacteria  Alphaproteobacteria  

    15%  of  the  nirK  conIgs  were  closest  match  to  rplB  from  Bradyrhizobium  japonicum  USDA  110    The  other  top  matches  were:  Ralstonia  pickeWi  12J,  Rhodanobacter  fulvus  Jip2  

  • Xander – Gene Targeted Metagenomic Assembler

    PCA  Analysis  using  OTU  abundance  at  95%  aa  idenSty  

    18  

    nirk   rplB  

    -0.2 -0.1 0.0 0.1

    -0.2

    -0.1

    0.0

    0.1

    0.2

    PC1 8.54%

    PC2

    5.91

    %

    CMS

    C

    M

    S

    -0.20 -0.10 0.00 0.05 0.10

    -0.15

    -0.050.00

    0.05

    0.10

    0.15

    PC1 6.58%

    PC2

    5.33

    %

    CMS

    C M

    S

  • Xander – Gene Targeted Metagenomic Assembler

    MulSpath  to  find  Sequence  Heterogeneity  

    hMp://rdp.cme.msu.edu  

    Xander  can  find  mulIple  paths  using  Yen’s  k  shortest  path  algorithm  1  starIng  kmer,  1000  paths,  37  unique  conIgs  

  • Xander – Gene Targeted Metagenomic Assembler

    Xander  Gene-‐targeted  Assembly  Processing  StaSsScs  

    hMp://rdp.cme.msu.edu  

    Sample  Name   Mock   K312   C1   7  Corns  Data  size  (GB)   1.7   7.4   46   349  Build  graph  (GB)   1   8   50   200  

    Build  graph  Time  (h)   0.3   0.4   6.4   41  

    Find  starIng  kmers  (h)  *   0.1   0.5   3.6   27.0  

    Search  conIgs  *   min   min   h   h  

    nifH               0.3   0.1   0.02   0.1  

    nirk   NA   0.7   0.8   36.7  

    rplB   1.1   7   3.8   49.4  

    The  processing  Ime  on  MSU  HPCC  network  drive,  single  CPU  *  can  be  mulIthreaded  or  be  run  in  parallel  

    1  lane  of  Illumina  Hiseq  run  in  <  20  h  

  • Xander – Gene Targeted Metagenomic Assembler

    Xander  Build  

    Xander  Search  

    Modified  HMMER3  

    Xander  Gene  Assembly  Workflow    

    Quality-‐filtered  Genes  

    Quality  Filtering  

    Post-‐Assembly  Analysis  

    Read  mapping,  Gene  coverage  Nearest  neighbor  assignments  Taxonomic  abundance  …  

  • Xander – Gene Targeted Metagenomic Assembler

    Xander  Assembly  Prep  Steps  

    1.  Build  specialized  forward  and  reverse  HMMs    •  Input:  a  small  set  of  aligned  seed  sequences  (using  

    original  HMMER3  and  HMMs  from  FunGene)  •  Output:  forward  and  reverse  HMMs  for  Xander  built  

    using  our  modified  HMMER3-‐mod  which  is  tuned  to  detect  close  homologs  

    2.  IdenIfy  starIng  kmers  •  Input  1:  A  larger  set  of  reference  sequences  (cover  

    all  possible  diversity)  that  was  aligned  by  the  forward  HMMs  using  HMMER3-‐mod    

    •  Input  2:  read  files  •  Output:  starIng  nucleoIde  kmers,  alignment  

    posiIons,  HMM  states  •  MulIple  genes  can  be  run  together  

    hMp://rdp.cme.msu.edu  

  • Xander – Gene Targeted Metagenomic Assembler

    Xander  Assembly  Steps  

    3.  Build  de  Brujin  graph  •  Input:  read  files  •  Output:  de  Bruijn  graph  structure  

    4.  Assemble  one  path  for  each  direcIon  for  each  start,  then  combine  into  one  conIg      •  Input  1:  forward  and  reverse  HMMs  •  Input  2:  de  Bruijn  graph  •  Input  3:  starIng  kmers  •  Output:  nucleoIde  and  protein  conIgs  

    5.  Quality  filter  •  Length  cutoff  and  HMM  score  cutoff  •  Cluster  at  99%,  chose  longest  conIgs  (RDP  mcClust)  •  Chimera  removal  (UCHIME)  •  Outputs:  quality-‐filtered  conIgs  

    hMp://rdp.cme.msu.edu  

  • Xander – Gene Targeted Metagenomic Assembler

    Xander  Post-‐Assembly  Analysis  

    6.  Read  Mapping  (RDP  KmerFilter)  •  Input:  quality-‐filtered  conIgs  •  Output:  coIg  coverage,  kmer  abundance  

    7.  Nearest  neighbor  assignment,  taxonomy  abundance  •  Input:  quality-‐filtered  conIgs  •  Input:  reference  seqs  •  Input:  coIg  coverage  •  Output:  nearest  matches  •  Taxonomic  abundance  adjusted  by  coverage  

    8.  Beta-‐diversity  analysis  (mulIple  samples)  •  Input:  quality-‐filtered  aligned  protein  conIgs  •  Input:  conIg  coverage  •  Output:  coverage-‐adjusted  OTU  abundance  matrix    

    hMp://rdp.cme.msu.edu  

  • Xander – Gene Targeted Metagenomic Assembler

    Xander  –  User  Efficient  

    25  

    •  Setup  Xander  –  GitHub  repo  hvps://github.com/rdpstaff/Xander_assembler  –  Step-‐by-‐step  instrucIons  –  preconfigured  with  rplB  gene,  and  nitrogen  cycling  genes  including  nirK,  nirS,  nifH,  nosZ,  norB  and  amoA  

    •  Prepare  the  HMMs,  this  step  requires  biological  insight!  –  Get  reference  sequences  for  gene(s)  (FunGene,  or  literature  search)  

    –  Build  specialized  HMMs  for  Xander  •  Get  metagenomic  data  •  Go  Xander  assembly  

    –  Choose  the  right  parameters  for  your  dataset,  see  instrucIons    

  • Xander – Gene Targeted Metagenomic Assembler

    Summary  

    •  Comparing  to  a  recent  targeted-‐gene  assembler  and  a  recent  bulk  assembly  method,  Xander  assembled  more  gene  conIgs,  longer  in  length  and  shared  higher  %  aa  idenIty  with  known  references.  

    •  Detects  low-‐abundance  genes  and  low-‐  abundance  organisms.  

    •  Provides  gene  abundance  and  kmer  abundance  esImate  •  HMMs  can  be  tailored  to  the  targeted  genes,  allowing  

    flexibility  to  improve  annotaIon  over  generic  annotaIon  pipelines.    

    •  Larger  kmer  size  improves  quality  by  reducing  chimeras,  but  may  results  in  shorter  conIgs.    

    26  

  • Xander – Gene Targeted Metagenomic Assembler

    Acknowledgements  

    James  Cole  James  Tiedje  Qiong  Wang  Jordan  Fish    Mariah  Gilman    Yanni  Sun  C.  Titus  Brown  Jiarong  Guo    

    27