U Florida / Gainesville talk, apr 13 2011

Divide and conquer applied to metagenomic DNA

C. Titus [email protected]

CSE / MMG, Michigan State University

mailto:[email protected]

A brief intro to shotgun assemblyIt was the best of times, it was the wor, it was the worst of times, it was the isdom, it was the age of foolishness

mes, it was the age of wisdom, it was th

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness

…but for 2 bn+ fragments.Not subdivisible; not easy to distribute; memory intensive.

the quick brown fox jumped

jumped over the lazy dog

the quick brown fox jumped over the lazy dog

na na na, batman!

my chemical romance: na na na

Repeats do cause problems:

Assemble based on word overlaps:

Whole genome shotgun sequencing & assembly

Randomly fragment & sequence from DNA;reassemble computationally.

UMD assembly primer (cbcb.umd.edu)

How does assembly scale?

• Our assembly approach scales with the amount of genomic novelty present in the sample.

• For “sane” problems (microbes, human genome, etc.) this isn’t too bad, although challenging.

• For metagenomes, with millions of different species at different abundances, this is an intractable problem (so far)…

Great Plains Grand Challenge –Sampling sites

• Wisconsin– Native prairie (Goose Pond, Audubon)– Long term cultivation (corn)– Switchgrass rotation (previously corn)– Restored prairie (from 1998)

• Iowa– Native prairie (Morris prairie)– Long term cultivation (corn)

• Kansas – Native prairie (Konza prairie)– Long term cultivation (corn)

Iowa Native Prairie

Switchgrass (Wisconsin)

Iowa >100 yr tilled

Sampling strategy per site

Reference soil

Soil cores: 1 inch diameter, 4 inches deep

Total:

8 Reference metagenomes +

64 spatially separated cores (pyrotag sequencing)

10 M

10 M

1 M

1 M

1 cM

1 cM

454 Titanium Shotgun sequencing

Illumina shotgun sequencing

Soil Metagenome

Community composition

454 Titanium Pyrotag sequencing

What kinds of questions?

• What genes are present?• What species are present?• What are those species doing, physiologically

speaking?• How does “function” change with cultivation,

CO2, fertilizer types, crop cycles, etc?

We are at a “pre-question” stage, unfortunately…

Iowa, Continuous c

orn

Iowa, Native P

rairie

Kansas

, Cultiva

ted co

rn

Kansas

, Native Prai

rie

Wisconsin

, Continuous c

orn

Wisconsin

, Native Prairie

Wisconsin

, Rest

ored Prairie

Wisconsin

, Switc

hgrass0

50

100

150

200

250

300

350

Great Prairie Sequencing Summary – Illumina whole metagenome shotgun

GAII HiSeq

Base

pair

s of S

eque

ncin

g (G

bp)

The basic problem.

• Lots of metagenomic sequence data(200 GB Illumina for < $20k?)

• Assembly, especially metagenome assembly, scales poorly (due to high diversity).

• Standard assembly techniques don’t work well with sequences from multiple abundance genomes.

• Many people don’t have the necessary computational resources to assemble (~1 TB of RAM or more, if at all).

We can’t just throw more hardware at the problem…

Lincoln Stein

Hat tip to Narayan Desai / ANL

We don’t have enough resources or people to analyze data.

Data generation vs data analysis

It now costs about $10,000 to generate a 200 GB sequencing data set (DNA) in about a week.

(Think: resequencing human; sequencing expressed genes; sequencing metagenomes, etc.)

…x1000 sequencers

Many useful analyses do not scale linearly in RAM or CPU with the amount of data.

The challenge:

Massive (and increasing) data generation capacity, operating at a boutique level, with

algorithms that are wholly incapable of scaling to the data volume.

Note: cloud computing isn’t a solution to a sustained scaling problem!! (See: Moore’s Law slide)

Awesomeness

Easy stuff like Google Search

Life’s too short to tackle the easy problems – come to academia!

Assembly of shotgun sequenceIt was the best of times, it was the wor, it was the worst of times, it was the isdom, it was the age of foolishness

mes, it was the age of wisdom, it was th

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness

…but for 2 bn+ fragments.Not subdivisible; not easy to distribute; memory intensive.

the quick brown fox jumped

jumped over the lazy dog

the quick brown fox jumped over the lazy dog

na na na, batman!

my chemical romance: na na na

Repeats do cause problems:

Assemble based on word overlaps:

Whole genome shotgun sequencing & assembly

Randomly fragment & sequence from DNA;reassemble computationally.

UMD assembly primer (cbcb.umd.edu)

K-mer graphs - overlaps

J.R. Miller et al. / Genomics (2010)

K-mer graphs - branching

For decisions about which paths etc, biology-based heuristics come into play as well.

Iowa, Continuous c

orn

Iowa, Native P

rairie

Kansas

, Cultiva

ted co

rn

Kansas

, Native Prai

rie

Wisconsin

, Continuous c

orn

Wisconsin

, Native Prairie

Wisconsin

, Rest

ored Prairie

Wisconsin

, Switc

hgrass0

50

100

150

200

250

300

350

Great Prairie Sequencing Summary – Illumina whole metagenome shotgun

GAII HiSeq

Base

pair

s of S

eque

ncin

g (G

bp)

Billions and billions of …>850:2:1:1943:15232/1 0CCTGCCTGTGGAGCAGCCCACGCAGTTCGAGCTGATCATCAACCTCAAGACGGCCCAAGCCCTTGGCATCACGATT>850:2:1:1943:15232/2 0ACACCATTTAATCTTAGCCATAAAAGTTGTATAAGCATCAACGTTTTGTTTGTCTCAAAAAACGATTTTTTTTTTG>850:2:1:1943:19543/1 0ACTGTAGGTTTCTGGCTGCGTCCGACGATAGCAGCCCGCTCTGCCGACATTGTCA>850:2:1:1945:16822/2 0AGTCGACAGATCGACCTGAAGGAGGTGCCGGGAATTGAAGTCATCCAGGGCGCCGAGGAGAACTGATCGG>850:2:1:1946:10202/2 0AGCTTTTTCGCGCGCGTGAAAAAGCTTTGTCGATTTCTGGGTTTCGGCCTTCTCACAGTCACCGCCGAGGGCCGGG>850:2:1:1947:6533/2 0GGTCTCCGGACACACGAAGGCACGGCTCTCCGAGAAGCGGAGGATGTACTCGACCTCACGGCTGC>850:2:1:1948:15431/1 0ACCGCTTACTCGATGATGGAGCAAGGCAGAATCGACATGATTCTGAGCTCGCGTCCCGAAGATCGACGCGCGG>850:2:1:1949:19998/1 0AATTCAAAGTAGGCATTTTTGTTTTTGTAGGGTTGGCGATGTTAGGCGCGCTGGTCGTGCAATTC>850:2:1:1950:4213/2 0CCAACCGGGCCCTGGTCCTGCACGCCAACCTGTCCCCGCTGGTGG>850:2:1:1950:1388/1 0CAGCCGCAATGTTGGCATTCTTCAGCAGTTCGAGCGCCACAAAGCGGTCATTGTCTGAGGCTTCTGGG

Too much data – what can we do?

• Reduce the size of the data (either with an approximate or an exact approach)

• Divide & conquer: subdivide the problem.

• For exact data reduction or subdivision, need to grok the entire assembly graph structure.

• …but that is why assembly scales poorly in the first place.

Two exact data reduction techniques:

• Eliminate reads that do not connect to many other reads.

• Group reads by connectivity into different partitions of the entire graph.

For k-mer graph assemblers like Velvet and ABYSS, these are exact solutions.

Eliminating unconnected reads

“Graphsize filtering”

Subdividing reads by connection

“Partitioning”

Two exact data reduction techniques:

• Eliminate reads that do not connect to many other reads (“graphsize filtering”).

• Group reads by connectivity into different partitions of the entire graph (“partitioning”).

For k-mer graph assemblers like Velvet and ABYSS, these are exact solutions.

Engineering overview

• Built a k-mer graph representation based on Bloom filters, a simple probabilistic data structure;

• With this, we can store graphs efficiently in memory, ~1-2 bytes/(unique) k-mer for arbitrary k.

• Also implemented efficient global traversal of extremely large graphs (5-20 bn nodes).

For details see source code (github.com/ctb/khmer), or online webinar: http://oreillynet.com/pub/e/1784

Store graph nodes in Bloom filter

Graph traversal is done in full k-mer space;

Presence/absence of individual nodes is kept in Bloom filter data structure (hash tables w/o collision tracking).

Practical application

• Enables:– graph trimming (exact removal)– partitioning (exact subdivision)– abundance filtering

• … all for K <= 64, for 200+ gb sequence collections.

• All results (except for comparison) obtained using a single Amazon EC2 4xlarge node, 68 GB of RAM / 8 cores.

• Similar running times to using Velvet alone.

We pre-filter data for assembly:

Does removing small graphs work?

Small data set (35m reads / 3.4 gb rhizosphere soil sample)

Filtered at k=32, assembled at k=33 with ABYSS

N contigs / Total bp Largest contig130 223,341 61,766

Unfiltered (35m)130 223,341 61,766

Filtered (2m reads)

YES.

Does partitioning into disconnected graphs work?

Partitioned same data set (35m reads / 3.5 gb) into 45k partitions containing > 10 reads; assembled partitions separately (k0=32, k=33).

N contigs / Total bp Largest contig130 223,341 61,766

Unfiltered (35m)130 223,341 61,766

Sum partitions

YES.

Data reduction for assembly / practical details

Reduction performed on machine with 16 gb of RAM.

Removing poorly connected reads: 35m -> 2m reads.- Memory required reduced from 40 gb to 2 gb;- Time reduced from 4 hrs to 20 minutes.

Partitioning reads into disconnected groups:- Biggest group is 300k reads- Memory required reduced from 40 gb to 500 mb;- Time reduced from 4 hrs to < 5 minutes/group.

Does it work on bigger data sets?

35 m read data set partition sizes:

P1: 277,043 readsP2: 5776 readsP3: 4444 readsP4: 3513 readsP5: 2528 readsP6: 2397 reads…

Iowa continuous corn GA2 partitions (218.5 m reads):

P1: 204,582,365 readsP2: 3583 readsP3: 2917 readsP4: 2463 readsP5: 2435 readsP6: 2316 reads…

Problem: big data sets have one big partition!?

• Too big to handle on EC2.

• Assembles with low coverage.

• Contains 2.5 bn unique k-mers (~500 microbial genomes), at ~3-5x coverage

• As we sequence more deeply, the “lump” becomes bigger percentage of reads => trouble!– Both for our approach,– And possibly for assembly in general (because it assembles more poorly

than it should, for given coverage/size)

Why this lump?

1. Real biological connectivity (rRNA, conserved genes, etc.)

2. Bug in our software

3. Sequencing artifact or error

Why this lump?

1. Real biological connectivity? Probably not.- Increasing K from 32 to ~64 didn’t break up the lump: not biological.

2. Bug in our software? Probably not.- We have a second, completely separate approach &

implementation that confirmed the lump (bleu, by Rosangela Canino-Koning)

3. Sequencing artifact or error? YES.- (Note, we do filter & quality trim all sequences already)

“Good” vs “bad” assembly graph

Low density

High density

Non-biological levels of local graph connectivity:

Higher local graph density correlates with position in read

Higher local graph density correlates with position in read

ARTIFACT

Trimming reads• Trim at high “soddd”, sum of degree degree

distribution:– From each k-mer in each read, walk two k-mers in

all directions in the graph;– If more than 3 k-mers can be found at exactly two

steps, trim remainder of sequence.

Overly stringent; actually trimming (k-1) connectivity graph by degree.

Trimmed read examples

>895:5:1:1986:16019/2TGAGCACTACCTGCGGGCCGGGGACCGGGTCAGCCTGCTCGACCTGGGCCAACCGATGCGCC>895:5:1:1995:6913/1TTGCGCGCCATGAAGCGGTTAACGCGCTCGGTCCATAGCGCGATG>895:5:1:1995:6913/2GTTCATCGCGCTATGGACCGAGCGCGTTAACCGCTTCATGGCGCGCAAAGATCGGAAGAGCGTCGTGTAG

Preferential attachment due to bias

• Any sufficiently large collection of connected reads will have one or more reads containing an artifact;

• These artifacts will then connect that group of reads to all other groups possessing artifacts;

• …and all high-coverage contigs will amalgamate into a single graph.

Artifacts from sequencing falsely connect graphs

Preferential attachment due to bias

• Any sufficiently large collection of connected reads will have one or more reads containing an artifact;

• These artifacts will then connect that group of reads to all other groups possessing artifacts;

• …and all high-coverage contigs will amalgamate into a single graph.

Groxel view of knot-like region / Arend Hintze

Density trimming breaks up the lump:

Old P1, soddd trimmed(204.6 m reads -> 179 m):

P1: 23,444,332 readsP2: 60,703 readsP3: 48,818 readsP4: 39,755 readsP5: 34,902 readsP6: 33,284 reads…

Untrimmed partitioning (218.5 m reads):

P1: 204,582,365 readsP2: 3583 readsP3: 2917 readsP4: 2463 readsP5: 2435 readsP6: 2316 reads…

What does density trimming do to assembly?

204 m reads in lump: assembles into 52,610 contigs; total 73.5 MB

180 m reads in trimmed lump:assembles into 57,135 contigs;total 83.6 MB

(all contigs > 1kb)

Filtered/partitioned @k=32, assembled @ k=33, expcov=auto, cov_cutoff=0

Wait, what?

• Yes, trimming these “knot-like” sequences improves the overall assembly!

• We remove 25.6 m reads and gain 10.1 MB!?

• Trend is same for ABySS, another k-mer graph assembler, as well.

So what’s going on?

• Current assemblers are bad at dealing with certain graph structures (“knots”).

• If we can untangle knots for them, that’s good, maybe?

• Or, by eliminating locations where reads from differently abundant contigs connect, repeat resolution improves?

• Happens with other k-mer graph assemblers (ABYSS), and with at least one other (non-metagenomic) data set.

OK, let’s assemble!

Iowa corn (HiSeq + GA2): 219.11 Gb of sequence assembles to:

148,053 contigs,in 220 MB;max length 20322max coverage ~10x

…all done on Amazon EC2, ~ 1 week for under $500.

Filtered/partitioned @k=32, assembled @ k=33, expcov=auto, cov_cutoff=0

Full Iowa corn / mapping stats

• 1,806,800,000 QC/trimmed reads (1.8 bn)

• 204,900,000 reads map to some contig (11%)

• 37,244,000 reads map to contigs > 1kb (2.1%)

> 1 kb contig is a stringent criterion!

Compare:80% of MetaHIT reads to > 500 bp;

65%+ of rumen reads to > 1kb

Success, tentatively.

We are still evaluating assembly and assembly parameters; should be possible to improve in every way.

(~10 hrs to redo entire assembly, once partitioned.)

The main engineering point is that we can actually run this entire pipeline on a relatively small machine

(8 core/68 GB RAM)

We can do dozens of these in parallel on Amazon rental hardware.

And, from our preliminary results, we get ~ equivalent assembly results as if we were scaling our hardware.

Conclusions

• Engineering: can assemble large data sets.

• Scaling: can assemble on rented machines.

• Science: can optimize assembly for individual partitions.

• Science: retain low-abundance.

Conclusions

• Engineering: can assemble large data sets.

• Scaling: can assemble on rented machines.

• Science: can optimize assembly for individual partitions.

• Science: retain low-abundance.

Caveats

Quality of assembly??

• Illumina sequencing bias/error issue needs to be explored.

• Scaffolding with Velvet causes systematic problems

• Regardless of Illumina-specific issue, it’s good to have tools/approaches to look at structure of large graphs.

Future thoughts

• Our pre-filtering technique always has lower memory requirements than Velvet or other assemblers. So it is a good first step to try, even if it doesn’t reduce the problem significantly.

• Divide & conquer approach should allow more sophisticated (compute intensive) graph analysis approaches in the future.

• This approach enables (in theory) assembly of arbitrarily large amounts of metagenomic DNA sequence.

• Can k-mer filtering work for non-de Bruijn graph assemblers? (SGA, ALLPATHS-LG, …)

• mRNAseq and genome artifact filtering?

Kmer-> GTCGTAGTTCAGTTGGTTAGAACGCCGGCCTG 747:3:13:7042:16004/1 GATATCTGCAATATCCCGTTCGAATGGGGTCGTAGTTCAGTTGGTTAGAACGCCGGCCTGTCACGCCGGAGGCC 747:3:14:10559:9771/1 GAAATTCCGGTTTGATGCGGAGTCGTAGTTCAGTTGGTTAGAACGCCGGCCTGTCACGTCGGAGGTCGCGGGTTCG 747:3:14:17232:4498/1 CAAATTTGAGATCTGAGATCCCAGGGGTTTGCGGAGTCGTAGTTCAGTTGGTTAGAACGCCGGCCTGTCACGTCGG 747:3:15:7871:10206/1 TTTGCGGAGTCGTAGTTCAGTTGGTTAGAACGCCGGCCTGTCACGTCGGAGGTCGCGGGTTCGAGTCCCGTCGG 747:3:16:17865:15895/2 TCAGGAGACGCCAGGGCGGTCTGAGTTCTTCAGGGGTCGTAGTTCAGTTGGTTAGAACGCCGGCCTGTCACGCCGG 747:3:27:9549:13966/1 GGAGTCGTAGTTCAGTTGGTTAGAACGCCGGCCTGTCACGTCGGAGGTCGCGGGTTCGAGTCCCGTCGGCTCCGCC 747:3:30:10672:3136/1 GCGGGGTCGTAGTTCAGTTGGTTAGAACGCCGGCCTGTCACGCCGGAGGTCGCGAGTTCGAGTCTCGTCGGCCC

Better artifact filtering?

All paths lead to the same k-mers

0 5 10 15 20 250

10000

20000

30000

40000

50000

60000

70000

80000

90000

Number of times k-mer is traversed

Histogramof k-mertraversalcounts.

Estimating sequencing return on investment

• To reach ~rumen depth of sampling of top abundance organisms, would need ~1-2 TB

5x Sequencing Coverage(931 GB)

10x Sequencing Coverage(1900 GB)

<1% Novel Sequence

Argonne National Laboratory Institute for Genomic and Systems Biology


Earth Microbiome Projectwww.earthmicrobiome.org

• Goal – to systematically approach the problem of characterizing microbial life on earth

• Paradigm shift to analyzing communities from a microbes perspective:

• Strategy:– Explore microbes in environmental parameter space– Design ‘ideal’ strategy to interrogate these biomes– Acquire samples and sequence broad and deep both DNA,

mRNA and rRNA– Define microbial community structure and the protein universe

• Gilbert et al., 2010a,b Standards in Genomic Science, open access


• Challenges– 2.4 Quadrillion Base Pairs (2.4 Petabases) = 8000 HiSEQ2000 runs.

– Global Environmental Sample Database (GESI): identification and selection of 200,000 environmental samples, soil, air, marine and freshwater, host-associated, etc.

– The standardization of sampling, sample prep and sample processing, cataloging and sample metadata – Genomic Standards Consortium can help!

– The coordination of thousands of “volunteer” scientists for site characterization, sample collecting and processing

Earth Microbiome Projectwww.earthmicrobiome.org

Acknowledgements:The k-mer gang:

• Adina Howe

• Jason Pell• Rosangela Canino-Koning• Qingpeng Zhang• Arend Hintze

Collaborators:

• Jim Tiedje (Il padrino)

• Janet Jansson, Rachel Mackelprang, Regina Lamendella, Susannah Tringe, and many others (JGI)

• Charles Ofria (MSU)

Funding: USDA NIFA; MSU, startup and iCER; DOE; BEACON/NSF STC; Amazon Education.