The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... ·...

Preview:

Citation preview

The SMRTer Way: Single Genes to Complex

Genomes

Ulf Gyllensten, Professor

Department of Immunology, Genetics and Pathology, Science for Life Laboratory, Uppsala University, Uppsala, Sweden

Topics

• National Genomics Infrastructure (NGI).

• PacBio from single genes to complex

genomes.

• Among the five largest European sequencing

centers.

• Core facility open to Swedish research groups.

• MPS sequencing, Sanger sequencing and genotyping.

• Funded as a National Research Infrastructure by

SciLifeLab, Swedish Research Council (VR-RFI) and

KAW Foundation.

National Genomics Infrastructure (NGI)

MPS technologies at NGI

Short-read MPS Long-read MPS

Analysis cluster and storage of MPS data

• ~3 M cpuh/month on a

dedicated cluster

• ~7 PB storage.

• Long-term storage in

archive.

• CPU with extra large

memory (2TB)

…assembled genomes From reads to….

PacBio sequencing at NGI/Uppsala

Two Pacific Biosciences RSII systems

June 2013 August 2014

PacBio – Data production in Uppsala

Assembly projects

• BACs, YACs, fosmids, plasmids, • Gram positive and negativembacteria • Archaea • Parasitic protists • Fungi (yeasts, mushrooms) • Algae • Mosses • Higher plants • Worms • Butterflies, Insects • Birds • Lizards • Fish • Mammals

Applications on PacBio

Non-clinical applications Clinical applications

Complete genomes

BACs/YACs/plasmids

16S rRNA

Gap filling

Whole transcriptome sequencing

Isoform discovery

Amplicon sequencing

Mutation detection

Haplotype phasing

Target re-sequencing

Metagenomics

Procaryotic methylation

Chronic Myeloid Leukemia

Acute Myeloid Leukemia

HLA sequencing

Repeat expansions

Infection screening

PacBio applications

C. Targeted sequencing

A. Small genome assembly

B. De novo complex genome assembly

Small genome assembly

- PacBio the method of choice for small genomes.

- Sample quality is crucial. Good quality – an (almost)

complete genome, poor quality – partial or no genome.

Example:

Complex genomes: De Novo Assembly of Rabbit Genome Two for one genome : Assembly of an F1-hybrid between two subspecies of rabbit. PI: Professor Leif Andersson, Uppsala

Aims:

• Create a New reference assembl/y/ies

• In depth characterization of loci exhibiting strong allele frequency shifts around hybrid zone between O. c. coniculus and O. c. algirus in Spain.

• 2 % of genome shows dramatic reduction in ability to spread to other side, rest of genome leaks into other side.

Lagomorpha

The order Lagomorpha consists of two families: Leporidae (hares and rabbits) and the Ochotonidae (pikas) Likely radiated from common ancestor in Asia 60 million years ago European rabbit (Oryctulagus coniculus) and the closest extant species, the hispid hare (Caprolagus hispidus)in South Asia diverged approximately 7-10 million years ago, like most of the Leporidae

Evolutionary History of Lagomorphs in Response to Global Environmental Change, PLoS One, April 2013 | Volume 8 | Issue 4

O. c. algirus

O. c. coniculus

Dispersal to southern France

Origin and domestication of the European rabbit (O. cuniculus)

Strategy and challenges

• 300 SMRT cells (around 200Gb) run in Uppsala

– O.c.c x O.c.a hybrid

• 6 BioNano runs (by BioNano)

• Parents of F1-hybrid sequenced to 30x using PCR free Illumina libraries.

• BAC-ends and phosmids from Sanger assembly

– (250k & 2 million respectively)

• Sanger assembly OryCun2 (2.74 Gb)

• Falcon diploid assembly attempted

• Very high heterozygosity!

De Novo Assembly of Rabbit using BioNano

6 runs conducted with 400 Gb of molecules >150kb

16

Raw Data (molecules > 150 kb) Initial Assembly High Depth Assembly

Stringent Assembly

Data input 184 Gb 367 Gb 367 Gb

Number of genome maps 3595 3651 5172

Assembly size 2.57 Gb 3.76 Gb 4.44 Gb

Genome map N50 0.87 Mb 1.4 Mb 1.07 Mb

Longest genome map 4.5 Mb 6.4 Mb 6.3 Mb

Heterozygous Genome Maps are Produced

Ref

GM

17

- WGS of patient cohorts (n=10,000 ind /year). - Establish a Genetic Variant Database for the

Swedish Population (n = 1,000).

SciLifeLab Whole Human Genome Initiative

Genomics England: 100,000 whole genomes from patients by 2017.

Population genomics projects The 1000 Genomes Project - genomes of 2500 unidentified

people from 25 populations

A. Identify a cohort that reflects the genetic structure of the

Swedish population.

B. Generate WGS data using short- and long-read MPS

technologies.

C. Establish a user-friendly database to make information

available to the research community (association analyses)

and clinical genetics laboratories.

The Swedish Genetic Variant Project

The Swedish Twin registry

• Inclusion based on twinning and distribution like

population density.

• General population-prevalence of any disease.

• 10,000 individuals have been analysed with SNP arrays.

• Identify 1,000 individuals based on genetic structure

and diversity across Sweden.

Principal components of European samples from 1,000 genomes project and 10,000 Swedish samples

Finland

Northern Sweden

Southern- Central Sweden

England and Scotland

Italy

Spain

Main genetic differentiation between Southern - Central and Sweden Northern

Individuals selected for WGS and 1000 G EUR

Northern Sweden

Southern – Central Sweden

European samples from 1,000 genomes project and 1,000 selected Swedish samples

WGS of Swedish control cohort

Step 1:

•Short-read Illumina X-Ten data to 30X coverage of the 1,000

individuals.

•Standard pipeline (GATK) for variant calling (SNP and indels).

•Construct user-friendly database for the community to make

use of the data.

•Status:

– Identification of a control cohort – Q1 2015.

– Short-read MPS – Q1 2016.

– Data base implementation – Q1 2016.

Database for genetic variants CanvasDB (CANdidate Variant Analysis System and Data Base)

• Stores genetic variants with annotations, such as prediction of the

functional consequence. • At present the 3.1 billion genetic variants in the 1000 Genomes project. • Search time not proportional to database size. • Filter tools for analyses of monogenic and complex genetic disease

analyses.

The Present Human Reference is Not Complete

•Some regions have been recalcitrant to closure with short-read MPS

technology.

•Structural variation makes it difficult to assemble a truly representative

genome.

•Long-read whole human genome sequencing provide the information.

Genome reference standards

“Platinum” genome sequence

• A contiguous, haplotype resolved representation of the entire genome.

“Gold” genome sequence

• A high-quality, highly contiguous representation.

“Silver” genome sequence

• Standards TBD.

• Non-trio, PB/BN, no Bac library.

Gold Genome Sequencing Approach

Gold Reference Genomes

Platinum Reference Genomes

The Human Reference Genomes Project

CHM1

CHM13

NA19240

HG00733

NA12878

HG00514

NA19434

New Reference Human Genome Sequences

• Platinum Genomes – CHM1 An integrated assembly of Illumina, PacBio, BAC and BioNano

data.

– CHM13 PacBio data assembly + BioNano data.

• Gold Genomes – NA19240 Yoruba trio child; assembly completed.

– HG007333 Puerto Rican trio child; sequencing in progress.

– HG00514 Han Chinese trio child, Q4 2015.

– NA19434 Luhya (Kenya) trio child, Q1 2016.

WGS of Swedish cohort Step 2:

• Establish Swedish reference genome sequences by de novo

assembly of long-read Pacific Biosciences data (+BN).

Ref genome individuals

First Swedish PacBio WGS

First PacBio Assembly

# of contigs (>=0 bp) 7708

# of contigs (>=1000 bp) 7653

Total length (>=0 bp)

2844 Mb

Total length (>=1000 bp)

2844 Mb

No of contigs 7692

Largest contig 19.5 Mb

Total contig length 2844 Mb

N50 4.35 Mb

N75 1.97 Mb

• 20 kb library

• 157 SMRT cells

• 140 Gb data (~45X)

• FALCON assembly

WGS of Swedish control cohort

Step 3:

• Targeted long-read sequencing of regions of high

medical importance (HLA, Trinucleotide expansion

repeats).

• Resolve structural variation and repeats.

• Phase variation in repetitive regions and individual

alleles.

• Study the methylation pattern in native DNA.

• Long-range PCR.

• Target enrichment by hybridisation using

DNA or RNA probe arrays.

• Amplification-free targeted enrichment.

Methods for Targeted PacBio sequencing

Long-range PCR: HLA sequencing

HLA sequencing workflow

1. LR-PCR Amplification

5. Allele identification (GenDx)

2. SMRTbell prep

3. SMRT Sequencing

4. PB Long Amplicon Analysis

Long-range PCR: FADS

• FADS region has been under selection in human evolution

• Regulates the production of Omega-3/6 fatty acids (PUFA)

• Region is associated to many traits and diseases

•Two main haplotypes in humans: Ancestral and Derived

FADS project - functional variant at rs174557

Functional variant for FADS1 expression identified!

But is it linked to the Ancestral or Derived haplotype?

Pan et al (submitted)

PacBio sequencing of FADS region

Hybridization capture and pooled sequencing of FADS region:

AluYe5 rs174559 rs174556

> 1.2 kb

rs174557

Derived haplotype

increases FADS1 activity

Ancestral haplotype,

reduces FADS1 activity

Results:

Targeted enrichment using DNA probe arrays

Targeted enrichment using RNA probes

Modified version of PacBio+Agilent protocol

Capture of a ~2 kb library

Reads mapped back to human genome

Off-target capture of gene not in probe design region

• MIC-B gene is captured because of high similarity to MIC-A!

MIC-B MIC_A

De novo assembly of captured region

A method to resolve structural variations and repeats

• Repeat length in example: 300-500 bp

• Difficult or impossible to resolve with short reads

Amplification-free targeted enrichment

• Using Cas9 for targeting. • Sequence native DNA. • Compatible with multiple

targets: HTT, FMR1, ALS & SCA10 in one reaction.

• Under development

Input DNA

SMRTbell library

CAS9 targeting

Sequencing

What we sequence at NGI /

Adam Ameur Bioinformatician, NGS

Ignas Bunikis Bioinformatician, NGS

Christian Tellgren-Roth Bioinformatician, NGS

Ulf Gyllensten Platform director

Inger Jonasson Facility manager

Olga Vinnere Pettersson Project coordinator

Susana Häggqvist Research engineer

NGS

Cecilia Lindau Research engineer

NGS

Ulrika Broström Research engineer

NGS

Ida Höijer Research engineer

NGS

Maria Schenström Research engineer

NGS

Nina Williams Research engineer

NGS

Magdalena Andersson Research engineer

NGS

Carolina Ilbäck Research engineer

NGS

Anna Petri Research engineer

Sequencing Service

Anne-Christine Lindström Research engineer

Sequencing Service

Who does the sequencing?

What we sequence at NGI /

THANK YOU

Recommended