22
GENOTYPING-BY- SEQUENCING WHAT IS IT AND WHAT IS IT GOOD FOR? KEITH R. MERRILL NCSU – CROP SCIENCE

G ENOTYPING - BY -S EQUENCING WHAT IS IT AND WHAT IS IT GOOD FOR ? K EITH R. M ERRILL NCSU – C ROP S CIENCE

Embed Size (px)

Citation preview

GENOTYPING-BY-SEQUENCING WHAT IS IT AND WHAT IS IT GOOD FOR?

KEITH R. MERRILL

NCSU – CROP SCIENCE

GBS VS. RAD-SEQTHE ULTIMATE THROW

DOWN! (OF ACRONYMS)

GBS: Genotyping-by-Sequencing

RAD-Seq: Restriction-site associated DNA sequencing

GBS VS. RAD-SEQWHAT’S THE DIFFERENCE?

THE CONCEPT

Reduce the Genome

Pool Samples

Sequence Combined

Pool

Assign sequences

to individuals

Call Variants between

individuals

THE CONCEPT

Ind_1

Ind_2

Ind_3

Ind_4

Ind_5

Ind_n

It’s all about probability

THE CONCEPT

Ind_1

Ind_2

Ind_3

Ind_4

Ind_5

Ind_n

Reduce the genome and increase the probability of overlap

HOW IT WORKS

Ind1

Ind2

Ind3

Ind4

Ind5

IndN

Tag1

Tag2

Tag3

Tag4

Tag5

TagN

Tags (AKA Barcodes, MID Barcodes, etc.)

= CAGATA

= GAAGTG

= TAGCGGAT

= GGATA

= CACCA

= …

Tag1

Tag2

Tag3

Tag4

Tag5

TagN

HOW IT WORKS

(THE ONE ENZYME METHOD)Ind1

Ind2

Ind3

Ind4

Ind5

IndN

Tag1

Tag2

Tag3

Tag4

Tag5

TagN

Tag1

Tag2

Tag3

Tag4

Tag5

TagN

HOW IT WORKS

(THE TWO ENZYME METHOD)

HOW IT WORKSSIZE SELECTION

Base-pair range selected

HOW IT WORKSPOOLING

Tag1

Tag2

Tag3

Tag4

Tag5

TagN

Ind1

Ind2

Ind3

Ind4

Ind5

IndN

Tag1

Tag2

Tag3

Tag4

Tag5

TagN

Ind1

Ind2

Ind3

Ind4

Ind5

IndN

Size Selection(optional if using

two-enzymes)

WHY POOL SAMPLES?

On the Illumina Hi-seq 2000: • 8 lanes of sequencing, each capable

of giving 374 million reads.

• You can’t partition a lane.

• Sequencing is expensive ($1500 - $3000 per lane).

• You don’t need/want 374 million reads per individual.

A WORD ABOUT TAGS

• Hamming vs. Edit Distance

• Sequence errors may result from things other than sequencing.

• n-1 errors are the most common error encountered during oligo synthesis.

ANALYSIS IT’S ABOUT TIME… AND MONEY… AND TIME

Key Considerations:• Time• Computing power available• Amount of sequence data (back to

time)• Availability of a reference genome

KEY CONSIDERATIONS

• Study goals• Availability of a reference genome• Expected degree of polymorphism• Choice of restriction enzyme• DNA sample preparation• Adaptor design• PCR amplification• Sequencing• Pooling individuals• Analysis

ANALYSIS IT’S ABOUT TIME… AND MONEY… AND TIME

A Few Options:• Stacks

– For use with bi-parental mapping populations– Takes a lot of time– Looks at entire reads– Reference genome optional– Designed to work nicely with MySQL– More memory intensive

• UNEAK– For use with species without a reference genome– Uses only 64 bp of each read– MUCH faster than Stacks– Less memory intensive

• TASSEL– For use with species with a reference genome– Uses only 64 bp of each read– MUCH faster than Stacks– Less memory Intensive

• Custom scripts– Completely flexible (hence the ‘custom’)– Requires significant knowledge about programming (or knowing someone who does and is willing to help)

DOES IT WORK? NOTE: THIS IS WITH HEXAPLOID WHEAT AND NO REFERENCE GENOME

THE GOOD

• No ascertainment bias• Random distribution throughout the genome•May be useful for species without a reference

genome• Useful with genomic selection• May provide a large number of SNPs• Relatively low per sample cost

THE GOOD (CONT)

GBS is extremely flexible• Number of individuals per lane/flowcell• Choice of enzymes– Cut sites–Methylation sensitivity

• Size of fragments selected

THE BAD

• Poor reproducibility between runs

• Species without a reference genome *cannot* infer missing data

• Often dealing with large amounts of missing data

• Difficult to filter out false SNPs in non-mapping populations, unless you have a

reference genome and even then…

• In my opinion: this would be nigh impossible to use with association studies in

species without a reference genome UNLESS you sequence to very high

coverage to virtually eliminate missing data (alternatively, you could drastically

reduce the genome by your choice of enzymes – but this may be bad if your

expected degree of polymorphism is low)

Questions?