49
The SMRTer Way: Single Genes to Complex Genomes Ulf Gyllensten, Professor Department of Immunology, Genetics and Pathology, Science for Life Laboratory, Uppsala University, Uppsala, Sweden

The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

The SMRTer Way: Single Genes to Complex

Genomes

Ulf Gyllensten, Professor

Department of Immunology, Genetics and Pathology, Science for Life Laboratory, Uppsala University, Uppsala, Sweden

Page 2: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

Topics

• National Genomics Infrastructure (NGI).

• PacBio from single genes to complex

genomes.

Page 3: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

• Among the five largest European sequencing

centers.

• Core facility open to Swedish research groups.

• MPS sequencing, Sanger sequencing and genotyping.

• Funded as a National Research Infrastructure by

SciLifeLab, Swedish Research Council (VR-RFI) and

KAW Foundation.

National Genomics Infrastructure (NGI)

Page 4: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

MPS technologies at NGI

Short-read MPS Long-read MPS

Page 5: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

Analysis cluster and storage of MPS data

• ~3 M cpuh/month on a

dedicated cluster

• ~7 PB storage.

• Long-term storage in

archive.

• CPU with extra large

memory (2TB)

…assembled genomes From reads to….

Page 6: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

PacBio sequencing at NGI/Uppsala

Two Pacific Biosciences RSII systems

June 2013 August 2014

Page 7: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

PacBio – Data production in Uppsala

Page 8: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

Assembly projects

• BACs, YACs, fosmids, plasmids, • Gram positive and negativembacteria • Archaea • Parasitic protists • Fungi (yeasts, mushrooms) • Algae • Mosses • Higher plants • Worms • Butterflies, Insects • Birds • Lizards • Fish • Mammals

Page 9: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

Applications on PacBio

Non-clinical applications Clinical applications

Complete genomes

BACs/YACs/plasmids

16S rRNA

Gap filling

Whole transcriptome sequencing

Isoform discovery

Amplicon sequencing

Mutation detection

Haplotype phasing

Target re-sequencing

Metagenomics

Procaryotic methylation

Chronic Myeloid Leukemia

Acute Myeloid Leukemia

HLA sequencing

Repeat expansions

Infection screening

Page 10: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

PacBio applications

C. Targeted sequencing

A. Small genome assembly

B. De novo complex genome assembly

Page 11: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

Small genome assembly

- PacBio the method of choice for small genomes.

- Sample quality is crucial. Good quality – an (almost)

complete genome, poor quality – partial or no genome.

Example:

Page 12: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

Complex genomes: De Novo Assembly of Rabbit Genome Two for one genome : Assembly of an F1-hybrid between two subspecies of rabbit. PI: Professor Leif Andersson, Uppsala

Aims:

• Create a New reference assembl/y/ies

• In depth characterization of loci exhibiting strong allele frequency shifts around hybrid zone between O. c. coniculus and O. c. algirus in Spain.

• 2 % of genome shows dramatic reduction in ability to spread to other side, rest of genome leaks into other side.

Page 13: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

Lagomorpha

The order Lagomorpha consists of two families: Leporidae (hares and rabbits) and the Ochotonidae (pikas) Likely radiated from common ancestor in Asia 60 million years ago European rabbit (Oryctulagus coniculus) and the closest extant species, the hispid hare (Caprolagus hispidus)in South Asia diverged approximately 7-10 million years ago, like most of the Leporidae

Evolutionary History of Lagomorphs in Response to Global Environmental Change, PLoS One, April 2013 | Volume 8 | Issue 4

Page 14: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

O. c. algirus

O. c. coniculus

Dispersal to southern France

Origin and domestication of the European rabbit (O. cuniculus)

Page 15: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

Strategy and challenges

• 300 SMRT cells (around 200Gb) run in Uppsala

– O.c.c x O.c.a hybrid

• 6 BioNano runs (by BioNano)

• Parents of F1-hybrid sequenced to 30x using PCR free Illumina libraries.

• BAC-ends and phosmids from Sanger assembly

– (250k & 2 million respectively)

• Sanger assembly OryCun2 (2.74 Gb)

• Falcon diploid assembly attempted

• Very high heterozygosity!

Page 16: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

De Novo Assembly of Rabbit using BioNano

6 runs conducted with 400 Gb of molecules >150kb

16

Raw Data (molecules > 150 kb) Initial Assembly High Depth Assembly

Stringent Assembly

Data input 184 Gb 367 Gb 367 Gb

Number of genome maps 3595 3651 5172

Assembly size 2.57 Gb 3.76 Gb 4.44 Gb

Genome map N50 0.87 Mb 1.4 Mb 1.07 Mb

Longest genome map 4.5 Mb 6.4 Mb 6.3 Mb

Page 17: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

Heterozygous Genome Maps are Produced

Ref

GM

17

Page 18: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

- WGS of patient cohorts (n=10,000 ind /year). - Establish a Genetic Variant Database for the

Swedish Population (n = 1,000).

SciLifeLab Whole Human Genome Initiative

Page 19: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

Genomics England: 100,000 whole genomes from patients by 2017.

Population genomics projects The 1000 Genomes Project - genomes of 2500 unidentified

people from 25 populations

Page 20: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

A. Identify a cohort that reflects the genetic structure of the

Swedish population.

B. Generate WGS data using short- and long-read MPS

technologies.

C. Establish a user-friendly database to make information

available to the research community (association analyses)

and clinical genetics laboratories.

The Swedish Genetic Variant Project

Page 21: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

The Swedish Twin registry

• Inclusion based on twinning and distribution like

population density.

• General population-prevalence of any disease.

• 10,000 individuals have been analysed with SNP arrays.

• Identify 1,000 individuals based on genetic structure

and diversity across Sweden.

Page 22: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

Principal components of European samples from 1,000 genomes project and 10,000 Swedish samples

Finland

Northern Sweden

Southern- Central Sweden

England and Scotland

Italy

Spain

Page 23: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

Main genetic differentiation between Southern - Central and Sweden Northern

Individuals selected for WGS and 1000 G EUR

Northern Sweden

Southern – Central Sweden

European samples from 1,000 genomes project and 1,000 selected Swedish samples

Page 24: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

WGS of Swedish control cohort

Step 1:

•Short-read Illumina X-Ten data to 30X coverage of the 1,000

individuals.

•Standard pipeline (GATK) for variant calling (SNP and indels).

•Construct user-friendly database for the community to make

use of the data.

•Status:

– Identification of a control cohort – Q1 2015.

– Short-read MPS – Q1 2016.

– Data base implementation – Q1 2016.

Page 25: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

Database for genetic variants CanvasDB (CANdidate Variant Analysis System and Data Base)

• Stores genetic variants with annotations, such as prediction of the

functional consequence. • At present the 3.1 billion genetic variants in the 1000 Genomes project. • Search time not proportional to database size. • Filter tools for analyses of monogenic and complex genetic disease

analyses.

Page 26: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

The Present Human Reference is Not Complete

•Some regions have been recalcitrant to closure with short-read MPS

technology.

•Structural variation makes it difficult to assemble a truly representative

genome.

•Long-read whole human genome sequencing provide the information.

Page 27: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

Genome reference standards

“Platinum” genome sequence

• A contiguous, haplotype resolved representation of the entire genome.

“Gold” genome sequence

• A high-quality, highly contiguous representation.

“Silver” genome sequence

• Standards TBD.

• Non-trio, PB/BN, no Bac library.

Page 28: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

Gold Genome Sequencing Approach

Page 29: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

Gold Reference Genomes

Platinum Reference Genomes

The Human Reference Genomes Project

CHM1

CHM13

NA19240

HG00733

NA12878

HG00514

NA19434

Page 30: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

New Reference Human Genome Sequences

• Platinum Genomes – CHM1 An integrated assembly of Illumina, PacBio, BAC and BioNano

data.

– CHM13 PacBio data assembly + BioNano data.

• Gold Genomes – NA19240 Yoruba trio child; assembly completed.

– HG007333 Puerto Rican trio child; sequencing in progress.

– HG00514 Han Chinese trio child, Q4 2015.

– NA19434 Luhya (Kenya) trio child, Q1 2016.

Page 31: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

WGS of Swedish cohort Step 2:

• Establish Swedish reference genome sequences by de novo

assembly of long-read Pacific Biosciences data (+BN).

Ref genome individuals

Page 32: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

First Swedish PacBio WGS

First PacBio Assembly

# of contigs (>=0 bp) 7708

# of contigs (>=1000 bp) 7653

Total length (>=0 bp)

2844 Mb

Total length (>=1000 bp)

2844 Mb

No of contigs 7692

Largest contig 19.5 Mb

Total contig length 2844 Mb

N50 4.35 Mb

N75 1.97 Mb

• 20 kb library

• 157 SMRT cells

• 140 Gb data (~45X)

• FALCON assembly

Page 33: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

WGS of Swedish control cohort

Step 3:

• Targeted long-read sequencing of regions of high

medical importance (HLA, Trinucleotide expansion

repeats).

• Resolve structural variation and repeats.

• Phase variation in repetitive regions and individual

alleles.

• Study the methylation pattern in native DNA.

Page 34: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

• Long-range PCR.

• Target enrichment by hybridisation using

DNA or RNA probe arrays.

• Amplification-free targeted enrichment.

Methods for Targeted PacBio sequencing

Page 35: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

Long-range PCR: HLA sequencing

Page 36: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

HLA sequencing workflow

1. LR-PCR Amplification

5. Allele identification (GenDx)

2. SMRTbell prep

3. SMRT Sequencing

4. PB Long Amplicon Analysis

Page 37: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

Long-range PCR: FADS

• FADS region has been under selection in human evolution

• Regulates the production of Omega-3/6 fatty acids (PUFA)

• Region is associated to many traits and diseases

•Two main haplotypes in humans: Ancestral and Derived

Page 38: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

FADS project - functional variant at rs174557

Functional variant for FADS1 expression identified!

But is it linked to the Ancestral or Derived haplotype?

Pan et al (submitted)

Page 39: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

PacBio sequencing of FADS region

Hybridization capture and pooled sequencing of FADS region:

AluYe5 rs174559 rs174556

> 1.2 kb

rs174557

Derived haplotype

increases FADS1 activity

Ancestral haplotype,

reduces FADS1 activity

Results:

Page 40: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

Targeted enrichment using DNA probe arrays

Page 41: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

Targeted enrichment using RNA probes

Modified version of PacBio+Agilent protocol

Page 42: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

Capture of a ~2 kb library

Reads mapped back to human genome

Page 43: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

Off-target capture of gene not in probe design region

• MIC-B gene is captured because of high similarity to MIC-A!

MIC-B MIC_A

Page 44: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

De novo assembly of captured region

A method to resolve structural variations and repeats

• Repeat length in example: 300-500 bp

• Difficult or impossible to resolve with short reads

Page 45: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

Amplification-free targeted enrichment

• Using Cas9 for targeting. • Sequence native DNA. • Compatible with multiple

targets: HTT, FMR1, ALS & SCA10 in one reaction.

• Under development

Input DNA

SMRTbell library

CAS9 targeting

Sequencing

Page 47: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

What we sequence at NGI /

Page 48: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

Adam Ameur Bioinformatician, NGS

Ignas Bunikis Bioinformatician, NGS

Christian Tellgren-Roth Bioinformatician, NGS

Ulf Gyllensten Platform director

Inger Jonasson Facility manager

Olga Vinnere Pettersson Project coordinator

Susana Häggqvist Research engineer

NGS

Cecilia Lindau Research engineer

NGS

Ulrika Broström Research engineer

NGS

Ida Höijer Research engineer

NGS

Maria Schenström Research engineer

NGS

Nina Williams Research engineer

NGS

Magdalena Andersson Research engineer

NGS

Carolina Ilbäck Research engineer

NGS

Anna Petri Research engineer

Sequencing Service

Anne-Christine Lindström Research engineer

Sequencing Service

Who does the sequencing?

Page 49: The SMRTer Way: Single Genes to Complex Genomes › files.pacb.com › pdf › Ulf... · •Targeted long-read sequencing of regions of high medical importance (HLA, Trinucleotide

What we sequence at NGI /

THANK YOU