31
International Sheep Genomics Consortia www.sheephapmap.org Assembling the sheep genome via KAREN John McEwan (AgResearch Invermay) on behalf of the International Sheep Genomics Consortium

International Sheep Genomics Consortia Assembling the sheep genome via KAREN John McEwan (AgResearch Invermay) on behalf of the International

Embed Size (px)

Citation preview

Page 1: International Sheep Genomics Consortia  Assembling the sheep genome via KAREN John McEwan (AgResearch Invermay) on behalf of the International

International SheepGenomics Consortia

www.sheephapmap.org

Assembling the sheep genome via KAREN

John McEwan (AgResearch Invermay) on behalf of the

International Sheep Genomics Consortium

Page 2: International Sheep Genomics Consortia  Assembling the sheep genome via KAREN John McEwan (AgResearch Invermay) on behalf of the International

Why?

• We want to improve genetic gain in sheep– Can use whole genome selection

• Need a high density SNP chip– Need a genome sequence and SNPs

» Need to sequence and assemble sheep

• Use new sequencing technology• Job too big and expensive for one group• International Consortia developed

Page 3: International Sheep Genomics Consortia  Assembling the sheep genome via KAREN John McEwan (AgResearch Invermay) on behalf of the International

Whole genome selection• Major scientific advance• Genome sequencing & SNP chips

= “genome wide selection”

• As accurate as progeny testing, – but can be done at birth

• Suitable for – sex limited, – difficult to measure traits or – traits measured late in life

• Dairy cattle: – increase genetic gain 50-100% – while decreasing progeny testing costs

• Application in sheep is still being explored but has great advantages

• Numerous other species and uses for sequence

Page 4: International Sheep Genomics Consortia  Assembling the sheep genome via KAREN John McEwan (AgResearch Invermay) on behalf of the International

What is a SNP and what is a SNP chip?

• SNP = single nucleotide polymorphism

• SNP chip = test 60,000 to 1,000,000 SNPs • WGS works by being able to:

– predict status of other SNP variants nearby– includes variants that affect production traits

MELD atcgcgtgtagctagtgctagctgctagctagctgatgcaROM1_read12667 .............t..........................AWA1_read00345 ........................................ SBF1_read06734 ........................................TEX1_read00234 .............t..........................ROM1_read10385 .............t..........................TEX1_read39890 ........................................

Page 5: International Sheep Genomics Consortia  Assembling the sheep genome via KAREN John McEwan (AgResearch Invermay) on behalf of the International

ISGC – division of labor

• 6 sites• AgR hosts core database• tasks divided to best utilise skills• best utilise resources • history was DVDs…..• versioning• KAREN for transfer of data• make available to world

Page 6: International Sheep Genomics Consortia  Assembling the sheep genome via KAREN John McEwan (AgResearch Invermay) on behalf of the International

Roche 454 FLX Skim Sequencing Strategy

Romney

Texel

Scottish Blackface

Otago University Baylor College HGSC

Merino

Poll Dorset

Awassi

0.5x 0.5x

0.5x 0.5x

0.5x 0.5x

Repeat mask

Blast BT4.0 + BT2 addns

Assemble with Newbler

Meld against bovine scaffold

Reorder via ovine BES

Detect SNPs

Page 7: International Sheep Genomics Consortia  Assembling the sheep genome via KAREN John McEwan (AgResearch Invermay) on behalf of the International

dB Summary

Breed Source SeqCount AvgLength BaseCount

Awassi Baylor 7113075 2351,675,721,39

4

Merino Baylor 9004167 2201,995,873,00

2

Poll Dorset Baylor 7917802 2381,890,589,11

5

Romney Otago 6008805 2191,330,683,71

0

Scottish Blackface

Otago 5611006229 1,273,929,90

0

Texel Otago 6735328 2271,529,979,98

6

TOTAL 41,000,1929,696,777,1

07

Page 8: International Sheep Genomics Consortia  Assembling the sheep genome via KAREN John McEwan (AgResearch Invermay) on behalf of the International

Data• ~90 “runs” on 454• Per run

– Sequence ~130Mb– Processed data ~800Mb (quality ….)– Raw data 33Mb x 412 images = 13.6Gb

• Total 1224 +72 = 1296Gb• Actually more as used another

technology• This gives a false impression…..

Page 9: International Sheep Genomics Consortia  Assembling the sheep genome via KAREN John McEwan (AgResearch Invermay) on behalf of the International

Repeat masking

• Created own repeat database– Almost all slight variants existing repeats– Only masks ~2% more bases (40% total)– Speeds mapping to bovine genome– Takes about 4-5 days on ~120 CPUs– File size ~10Gbp…– Versioning important…

Page 10: International Sheep Genomics Consortia  Assembling the sheep genome via KAREN John McEwan (AgResearch Invermay) on behalf of the International

BLAST results

• Map to cattle

• 46% uniquely

• 4.4% unambiguously

• Issues: options, time taken, size of output– Weeks of processing time…..

Page 11: International Sheep Genomics Consortia  Assembling the sheep genome via KAREN John McEwan (AgResearch Invermay) on behalf of the International

Newbler assembly eg 1Mbp region

Contigs plus Singletonsnumber 2365numberOfBases 1,032,839avgSize bp 437

Coverage% unadj 51.6adj 52.7

• This could only be done in one location. Fast (several days). However, alternatives needed to be explored…

• Results needed to be transferred (~3Gbp)

Page 12: International Sheep Genomics Consortia  Assembling the sheep genome via KAREN John McEwan (AgResearch Invermay) on behalf of the International

Meld Process

Ovine contigs

Align (BLAST)

reference

contigs

MELDCTAGTGCATGCTGCactTGCTataTGTGCtagNNgcATATTGCTGNNTGCTAT

Bovine reference scaffold

Figure 2. Creation of MELDed ovine sequence using bovine as a guide genome

Ovine contigs

Align (BLAST)

reference

contigs

MELDCTAGTGCATGCTGCactTGCTataTGTGCtagNNgcATATTGCTGNNTGCTAT

Bovine reference scaffold

Figure 2. Creation of MELDed ovine sequence using bovine as a guide genome

Page 13: International Sheep Genomics Consortia  Assembling the sheep genome via KAREN John McEwan (AgResearch Invermay) on behalf of the International

MELD• Contigs

– Ordered– Orientated– Use BT4+– Contigs

~480bp

Page 14: International Sheep Genomics Consortia  Assembling the sheep genome via KAREN John McEwan (AgResearch Invermay) on behalf of the International

OA_ver.1.0 coverage

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

BTA1_OA_v

er.1

.0

BTA2_OA_v

er.1

.0

BTA3_OA_v

er.1

.0

BTA4_OA_v

er.1

.0

BTA5_OA_v

er.1

.0

BTA6_OA_v

er.1

.0

BTA7_OA_v

er.1

.0

BTA8_OA_v

er.1

.0

BTA9_OA_v

er.1

.0

BTA10_O

A_ver

.1.0

BTA11_O

A_ver

.1.0

BTA12_O

A_ver

.1.0

BTA13_O

A_ver

.1.0

BTA14_O

A_ver

.1.0

BTA15_O

A_ver

.1.0

BTA16_O

A_ver

.1.0

BTA17_O

A_ver

.1.0

BTA18_O

A_ver

.1.0

BTA19_O

A_ver

.1.0

BTA20_O

A_ver

.1.0

BTA21_O

A_ver

.1.0

BTA22_O

A_ver

.1.0

BTA23_O

A_ver

.1.0

BTA24_O

A_ver

.1.0

BTA25_O

A_ver

.1.0

BTA26_O

A_ver

.1.0

BTA27_O

A_ver

.1.0

BTA28_O

A_ver

.1.0

BTA29_O

A_ver

.1.0

BTAcont

ig_OA_v

er.1

.0

BTArept

ig_OA_v

er.1

.0

BTAchru

n_OA_v

er.1

.0

BTAx_OA_v

er.1

.0

adj percent non nN ver.1.0

coverage of nonrepetitivebtau4 fraction ver.1.0

Assembly 3.158 Gbp with 1.242 Gbp non N

Page 15: International Sheep Genomics Consortia  Assembling the sheep genome via KAREN John McEwan (AgResearch Invermay) on behalf of the International

Copy number variants• QC process

– Need sanity checks that assembly is correct

• CNVs– Regions >1000bp present variable numbers of times in genome– Often duplicated by unequal recombination

– Can confuse SNP detection and v freq source of assembly errors

• Detection– Use “adjusted” depth of ovine 454 reads mapped to BT4 genome– At each base pair count depth for each animal and average– Done using 50kb window with 1kb increments

• Results– Average depth animal ~0.45X– 1-3 CNV regions detected/chromosome– Appear to be true CNVs

Page 16: International Sheep Genomics Consortia  Assembling the sheep genome via KAREN John McEwan (AgResearch Invermay) on behalf of the International

Chromosome 1: as an example

Page 17: International Sheep Genomics Consortia  Assembling the sheep genome via KAREN John McEwan (AgResearch Invermay) on behalf of the International

Example putative CNV: BTA1:149Mbp

Page 18: International Sheep Genomics Consortia  Assembling the sheep genome via KAREN John McEwan (AgResearch Invermay) on behalf of the International

BTA1:149Mbp Gbrowse view

Page 19: International Sheep Genomics Consortia  Assembling the sheep genome via KAREN John McEwan (AgResearch Invermay) on behalf of the International

SNP detection• SNP Detection Criteria

– Stacking: collapsed where reads same (animal , plate, bp)

– Depth: >3 (35% of sequence) and <9 reads deep

– MAF: at least 2 reads present – SNP Class:

– A 2 or more animals present for both alleles. – B 2 or more animals present for at least 1 allele, – C alleles one animal

– SNP quality: read will be discarded if:• variants 10bp either side• homopolymeric runs (n>4) within 5bp• indels within 10bp

Page 20: International Sheep Genomics Consortia  Assembling the sheep genome via KAREN John McEwan (AgResearch Invermay) on behalf of the International

How biased is the sampling?Block depth (only blocks with reads are shown)

0

200

400

600

800

1000

1200

1400

1600

1800

2000

0 2 4 6 8 10 12 14 16 18 20

block depth (minus one occurrence of the guide sequence)

cou

nt

count (raw alignments)

count (C minus stacks from same plate)

expected Poisson

Page 21: International Sheep Genomics Consortia  Assembling the sheep genome via KAREN John McEwan (AgResearch Invermay) on behalf of the International

Interim SNP ResultsMELD atcgcgtgtagctagtgctagctgctagctagctgatgcaROM1_read12667 .............t..........................AWA1_read00345 ........................................ SBF1_read06734 ........................................TEX1_read00234 .............t..........................ROM1_read10385 .............t..........................TEX1_read39890 ........................................

4 unique reads to do a callA = both alleles seen by 2 animalsB = 1 allele seen in 2 animalsC = both alleles seen one animal

2= Infinium 2 SNPs1 probe 50bp no G/C,A/T SNPs

1= Infinium 1 SNPs 2 probes 50bp

~69% pass design (0.8 threshold)~ 200K SNPs or ~3 SNPs/50Kb

As expected but rather low

~5/50kb better

575,44537,345273,832264,268Grand Total

72,3824,78234,45133,1491

503,06332,563239,381231,1192

Grand Total

CBAclass

SNP_class

454 SNPs detected in genome (excluding chrUn)

594,68138,507282,826273,348Grand Total

74,7974,94935,59234,2561

519,88433,558247,234239,0922

Grand Total

CBAclass

SNP_class

454 SNPs detected in genome (including chrUn)

575,44537,345273,832264,268Grand Total

72,3824,78234,45133,1491

503,06332,563239,381231,1192

Grand Total

CBAclass

SNP_class

454 SNPs detected in genome (excluding chrUn)

594,68138,507282,826273,348Grand Total

74,7974,94935,59234,2561

519,88433,558247,234239,0922

Grand Total

CBAclass

SNP_class

454 SNPs detected in genome (including chrUn)

Page 22: International Sheep Genomics Consortia  Assembling the sheep genome via KAREN John McEwan (AgResearch Invermay) on behalf of the International

Information resources

BLAST & sequence downloadavailable

Up to date information on ISGC project aims and progress

www.sheephapmap.org https://isgcdata.agresearch.nz

Page 23: International Sheep Genomics Consortia  Assembling the sheep genome via KAREN John McEwan (AgResearch Invermay) on behalf of the International

Genome Annotation• Visualize sequence and annotation• Widely used• Concept of “tracks”• Each track has significant processing

requirements• Distribute tasks• Versioning again important• Significant data transfers• Can have more than 50 tracks

Page 24: International Sheep Genomics Consortia  Assembling the sheep genome via KAREN John McEwan (AgResearch Invermay) on behalf of the International
Page 25: International Sheep Genomics Consortia  Assembling the sheep genome via KAREN John McEwan (AgResearch Invermay) on behalf of the International
Page 26: International Sheep Genomics Consortia  Assembling the sheep genome via KAREN John McEwan (AgResearch Invermay) on behalf of the International
Page 27: International Sheep Genomics Consortia  Assembling the sheep genome via KAREN John McEwan (AgResearch Invermay) on behalf of the International

SNP validation and selection

• Validation– Selected 112 Class A 454 SNPs– Assay with Sequenom– Aim is >85% validation rate in IMF (end of July)– Achieved 81%

• Select 60K SNPs for chip– Spacing algorithms used based on quality (est

MAF, adjacent sequence) and position– Multiple runs– Target date Aug 22nd

Page 28: International Sheep Genomics Consortia  Assembling the sheep genome via KAREN John McEwan (AgResearch Invermay) on behalf of the International

Future • 60K chip Aug 22nd final date for SNPs

– Available December 2008• Assembly

– 2nd Assembly ~Aug 2008 • BT4+sens blast+CAP3 assembler: expect 20% more sequence

– 3rd Assembly ~Dec 2008• + all_vs_all: expect 10% more seq

– 4th Assembly ~June 2009• As above but include ~4.5Gbp more seq inc paired end reads

• Application 10X coverage with ~200bp paired end reads June 2009

• For each assembly – annotate with Gbrowse – detect SNPs

Page 29: International Sheep Genomics Consortia  Assembling the sheep genome via KAREN John McEwan (AgResearch Invermay) on behalf of the International

Lessons learned

• 8th year of international consortia (~4th)• Data volumes increasing rapidly (1000X)• Initial data transfer is not the major issue

– Storage, transfer, annotation is ongoing– Processing, synchronisation, sharing resources

• Generate more than 10X volume• Small numerous 0.1-5Gb transfers

– Needs reliable transparent high volume data transfer

– Still issues with firewalls– Currently using phone for humans… why (location)?

Page 30: International Sheep Genomics Consortia  Assembling the sheep genome via KAREN John McEwan (AgResearch Invermay) on behalf of the International

AcknowledgementsAgResearch NZ Baylor HGSC CSIRO Genesis

Faraday John McEwan Richard Gibbs Brian Dalrymple Chris Warkup Gemma Payne George Weinstock James Kijas Nessa O’Sullivan Donna M. Muzny Ross Tellam Tracey van Stijn Michael E. Holder Wes Barris Theresa Wilson Lynne Nazareth Sean McWilliamRudi Brauning Rebecca L. Thorton Abhirami Ratnakumar Alan McCulloch Christie Kovar David TownleyRussell Smithies Benoit Auvray

Roslin Institute sheepGENOMICS UNE/sheepGENOMICS Steve Bishop Terry Longhurst (MLA) Hutton Oddy

Rob Forage

University of Otago University of Sydney USDA Jo Stanton Frank Nicholas Tim Smith Chrissie Curt van Tassell Mark

Funding Genesis Faraday, University of Sydney ISL Grant, and Ovita NZ

Page 31: International Sheep Genomics Consortia  Assembling the sheep genome via KAREN John McEwan (AgResearch Invermay) on behalf of the International

International SheepGenomics Consortia

www.sheephapmap.org

Thanks KARENand team