Today

Preview:

DESCRIPTION

Today. Please read… S cience 291: 1304-1315. Human Genome Project Dissenters My Brush with Greatness?. 1992 : Two years into the HGP, two of the projects biggest critics were… - PowerPoint PPT Presentation

Citation preview

Today

• Please read…

Science 291: 1304-1315

Human Genome Project DissentersMy Brush with Greatness?

• 1992: Two years into the HGP, two of the projects biggest critics were…

– Sydney Brenner: believed that the HGP should focus on human EST collections, and sequence the genome of a simple vertebrate (Fugu).

– Craig Venter: believed that the clone-by-clone approach was not the most efficient way to proceed, suggested that shotgun approaches, and even a whole genome approach was feasible.

…they were both right.

Sydney Brenner

2002 Nobel Prize (Medicine/Physiology)

Sydney Brenner and John E. Sulston, Britain

H. Robert Horvitz, United States

– for discoveries concerning how genes regulate organ

development and a process of programmed cell death.

End sequenced cDNAs(complementary DNA)

Expressed Sequence TagsESTs

cDNA: synthetic DNA transcribed from a mRNA template,

– through the action of an RNA dependant DNA polymerase called reverse transcriptase.

Online Primer: est.html

Brenner was right….

Still Sequencing cDNAs,

- first and easiest look into any genome,

- useful in understanding genomic sequence (gene finding),

- helps determine splice site variants,

- shorter than genomic clones, fits in plasmids,

- etc.

…tissue specific ESTs are very useful.

Used for microarrays…

…an array of DNA that can be hybridized with probes to study patterns of gene expression.

Whole Genome Assembly• 1995: 1.8 Mbp Haemophilus influenza genome sequenced,

• 1996 - on : Mycoplasma, E. coli and others*,

• 1999: Chromosome 2 of Arabidopsis,

• 2000: Drosophila (120 Mbp) genome,

…Human, Mosquito, etc…

• Lots of genomes, several applications...

*WGA of bacterial, viral populations...

Venter was right….

J. Craig Venter

• 1 year, 120 megabases,

• Assembly algorithms could generate accurate genomic sequences,

• Interim assemblies (or mapping) were not necessary.

24 MARCH 2000 VOL 287 SCIENCE

Big Biology

Think About This…

…the plasmid library construction is the first critical step in WGA sequencing,

– “if the DNA libraries are not uniform in size, non-chimeric, and do not randomly represent the genome, then the subsequent steps cannot accurately reconstruct the genome sequence.”

– “We used automated high-throughput DNA sequencing and the computational infrastructure to enable efficient tracking of enormous amounts of sequence information (27.3 million sequence reads; 14.9 billion bp of sequence).”

Who’s DNA?

• 21 enrolled donors,

– age, sex, ethnographic group,

– one African-American,

– one Asian-Chinese,

– one Hispanic-Mexican,

– two Caucasions*.

Who’s Mostly?

J. Craig Venter

8, September 1999 - 25, June 2000 543 bp average sequence read

…back to humans…

What to know?Individuals,Libraries,

Sequence coverage,Clone coverage,Other?

WGA Outline

Online Primer:snps.html

5’- actgtacgtgtagctgaca… - 3’ 5’- tagcgtagttattttgc… - 3’

=

sequenced ends~543 bp

unsequenced insert~ known size

=

5’- actgtacgtgtagctgaca

actgtacgtgtagctgaca - 3’

insert

vector

sequencing primersDNA in sized libraries…

DNA sequence in mate-pairs…cartoons

8, September 1999 - 25, June 2000 543 bp average sequence read

…back to humans…

What to know?Individuals,Libraries,

Sequence coverage,Clone coverage,Other?

Whole Genome Assembly

What does Shredder Do?Why?

1. Screener

2. Overlapper

3. Unitigger/Discriminator,

4. Scaffolder,

5. Repeat Resolver.

Screener

...finds and “masks” microsatellite repeats, known repeated regions and ribosomal DNA, etc.

– “masked” regions not used to make contigs,

– “marks” the rest for overlapping.

atgacttacttactgcatatttatttatttatttatttatttatttatttatttatttatttatttatttatttatttgacgtgtacgtgtacgtgtagctgtacgtgtacgtgacgggccgcattatcgtgatgctacgtgtacgttatatctgatcgtgcatgtga

read:

atgacttacttactgcatatttatttatttatttatttatttatttatttatttatttatttatttatttatttatttgacgtgtacgtgtacgtgtagctgtacgtgtacgtgacgggccgcattatcgtgatgctacgtgtacgttatatctgatcgtgcatgtga

masked:

atgacttacttactgcatatttatttatttatttatttatttatttatttatttatttatttatttatttatttatttgacgtgtacgtgtacgtgtagctgtacgtgtacgtgacgggccgcattatcgtgatgctacgtgtacgttatatctgatcgtgcatgtga

marked:

Overlapper

...looks for end-to end overlaps of at least 40 bp with no more than 6% differences in match,

What’s the significance? ...a one in 1017 event.

<--tactgtacgtagctgtgatgttcctcggatatagcgggcatatttattacgctattgtacgtgt-3’

5’- gttcctcggatatagcgggcatatttattacgctattgtacgtgtaaagtatcgt-->

> 40 bp, < 6% mismatch

…given perfect randomness.

Good News

... uniquely assembled contigs (unitigs) are readily identifiable,

– all of the assembled sequences match over all of the known sequence,

- and -

...are consistent with an 8x sequence coverage.

Whole Genome Assembly

What does Shredder Do?Why?

1. Screener

2. Overlapper

3. Unitigger/Discriminator,

4. Scaffolder,

5. Repeat Resolver.

Unitigs

...contig cluster is consistent with expected size (+8),

...no dissimilar sequences between any members.

...the Screener doesn’t include all of the “low frequency” level repeats,

...so, a majority of the Overlapper outputs turned out to be bogus.

But(t):

What Now?

– “over-collapsed” assemblies are identified and broken down into unitigs when possible...

– …these “too-large” contig sets are sent to the Unitigger/Discriminator.

...over-collapsed.

...in a world where real data matches expected data, each locus would have 8X coverage,

...if there are genomic repeats, then sequences would be “over-represented”, on average, 8 more per repeat, per contig.

Unitigger...differentiates between a true overlap, and an overlap that includes more

than one loci.

Discriminator

...parses the “over-collapsed” contig by using sequence outside of the overlap region

Discriminator

...may yield u-unitigs.

Unitigger/Discriminator Output: correctly assembled contigs covering 73.6% of the genome.

Scaffolder

...contigs the contigs,

– uses mate-pair information, two or more consistent mate-pair matches yields 1 in 1010 odds of being chance.

Repeat Resolver ...most of the remaining gaps were due to repeats.

“Rocks”

Use “low Discriminator Value” contig sets to fill gaps,

- find two or more mate pairs with unambiguous matches in the scaffold near the gap (2 kb, 10kb or 50 kb), (1 in 107),

“Stones”

- find mate pair matches 2 kb, 10 kb, and 50 kb from gap, place the mate in the gap, check to see if it’s consistent with other “placed” sequences.

confirm matches

Repeat Resolver ...most of the remaining gaps were due to repeats.

“Rocks”

Use “low Discriminator Value” contig sets to fill gaps,

- find two or more mate pairs with unambiguous matches in the scaffold near the gap (2 kb, 10kb or 50 kb), (1 in 107),

“Stones”

- find mate pair matches 2 kb, 10 kb, and 50 kb from gap, place the mate in the gap, check to see if it’s consistent with other “placed” sequences.

If that Doesn’t Work

...find a mate-pair that spans the gap, and sequence it,

Sequence Walking

...make sequencing primer from BES...

Wednesday

• Questions about WGA,

• CSA,

• Comparisons,

• Quality Control, etc.

Recommended