Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

Preview:

DESCRIPTION

1000 Genomes - A deep catalog of Human Genetic Variation

Citation preview

1 1

Integrative analysis in 1000 Genomes data

Fuli Yu

BioInfoSummer, Adelaide Australia

2012

Outline

• Background overview of 1000G

• 1000G Phase I results

• BCM NGS variation analysis software

• Further development and timeline

2

3

The history before the 1000 Genomes Project

-Phase I and II: common SNPs in CEU, CHB, JPT, YRI -HapMap3: 11 populations -Patterns of linkage disequilibrium and haplotypes defined genome-wide

www.hapmap.org

• Complex diseases gene mapping – GWAS. • Characteristics of the human genome variants: allele frequency spectrum, LD

patterns, recombination rate variation… • Population genetics: selection, migration, drift, admixture

Impacts

1,449 published

GWA at p≤5x10-8

for 237 traits

4 4

Disease mutations are likely rare and heterogeneous

McClellan J and King M-C, 2010

‘Clan Genomics’ Lupski JR et al. 2011

5 5

The quest for rare genetic variation

Gibbs R 2005

HapMap

1000G

6 6

Project goal

“…sequence a large number of people, to provide a

comprehensive resource on human genetic variation…”

“…find most genetic variants that have frequencies of at

least 1% in the populations studies…”

www.1000genomes.org

1000 Genomes Project Design and Progress

• Pilot data collected in 2008; paper published October 2010 in Nature

– Companions in Science and Genome Research

– Other companions later

• Full project data collection and analysis underway – Phase 1 results published Nov 1st 2012

– Phase 2 / Phase 3 being completed

• Sequencing completion - early 2013 – Analysis completion in 2013-2014

8 8

Nature, Oct 2010

-179 WGS, 700 exon seq

-15M new SNPs

-CNV group

-Exon group

1000 Genomes Project Design and Progress

• Pilot data collected in 2008; paper published October 2010 in Nature

– Companions in Science and Genome Research

– Other companions later

• Full project data collection and analysis underway – Phase 1 results published Nov 1st 2012

– Phase 2 / Phase 3 being completed

• Sequencing completion - early 2013 – Analysis completion in 2013-2014

10

1000G Phase I populations

Mark DePristo

12

An integrative map of 40 million variants Low-pass Genomes

SNPs 38M

Low-pass Genomes Low-pass Genomes Low-pass Genomes

Low-pass Genomes

Low-pass Genomes Low-pass Genomes Low-pass Genomes Low-pass Genomes

Deep Exomes

INDELs 1.4M

SVs 14k

Integrated Genotypes ~40M

Hyun Min Kang

1000 Genomes Project Design and Progress

• Pilot data collected in 2008; paper published October 2010 in Nature

– Companions in Science and Genome Research

– Other companions later

• Full project data collection and analysis underway – Phase 1 results published Nov 1st 2012

– Phase 2 / Phase 3 being completed

• Sequencing completion - early 2013 – Analysis completion in 2013-2014

Discovery power

• 1% SNPs

– 99.3% genome / 99.8% exome

• 0.1% SNPs

– 70% genome / 90% exome

- Exome high r2>0.9 - with LD information, WGS genotype - improves MAF>=1% by 30-40% - unchanges MAF<0.1%

Phase 1 variants are of high quality

Overall genotype accuracy at ~99%

Hyun Min Kang

Hyun Min Kang

Sensitivity >96% in a given genome

Rare variation is population specific

• 17% of low frequency (0.5-5%) in a single ancestry group

• 53% of less than 0.5% in a single population

• African populations have many more low frequency variants due to bottleneck on other lineages

• All populations are enriched in rare variants – Explosive recent population

growth

Slide Courtesy of Paul Flicek Adam Auton, Gil McVean

Rare variants identify recent historical links between populations

48% of IBS variants shared with American populations

ASW shows stronger sharing with YRI than LWK

Adam Auton, Gil McVean

The proportion of rare variants by conservation

Tuuli Lappalainen

The proportion of rare variants by conservation

Tuuli Lappalainen

The proportion of rare variants by conservation

Tuuli Lappalainen

Implication for GWAS imputation

Bryan Howie, Hyun Min Kang

BCM NGS PIPELINES: ATLAS2 & SNPTOOLS

24

25

Overview of NGS variation analysis pipelines

Nielsen R 2011

SNPTools Atlas2

26 26

Atlas uses logistic regression: systematic errors

DistbNQSbSwapbRawQualityb 4321 Pr(SNP)i1

Pr(SNP)ilog

Items Values derived from our

training experiment

Z

score

Significance

(p-value)

Intercept α -3.3 -39 <2e-16

Coefficient b1 for raw quality score 0.11 19 <2e-16

Coefficient b2 for swap -3.5 28 <2e-16

Coefficient b3 for NQS 0.26 3 0.001

Coefficient b4 for relative position -0.37 -4 0.0005

j=1 (0/1) 2 (0/0) m (0/0) . . . .

i=1,

2,

.

.

.,

n

Read

harboring

reference

alleles

Read

harboring

substitutions

Reference

sequence

Shen et al. 2010 Genome Research

27 27

posterior Pr(SNP) using Bayesian

j=1 (0/1) 2 (0/0) m (0/0) . . . .

i=1,

2,

.

.

.,

n

Read

harboring

reference

alleles

Read

harboring

substitutions

Reference

sequence

Pr(error)i = 1 – Pr(SNP)i

Pr(error)j = ∏ Pr(error)i

Pr(SNP)j = 1- Pr(error)j = Sj

)|(),|Pr()|(),|Pr(

)|(),|Pr(),|Pr(

cerrorpriorcerrorScSNPpriorcSNPS

cSNPpriorcSNPScSSNP

jj

j

jj

Shen et al. 2010 Genome Research

28 28

Exome data summary

• 1128 (822 Illumina/306 SOLiD) samples in 20110521.alignment.index

– 822 Illumina BAMs

• MOSAIK

– 306 SOLiD BAMs

• BFAST

• SNPs are called using Atlas-SNP2 at BCM

29 29

Intersection #SNP:238,356 dbSNP: 48.5% Ti/Tv : 3.35

Baylor Exome Unique #SNP: 218,739 dbSNP: 8.2% Ti/Tv: 2.97

VQSR v2b Unique #SNP: 23,096 dbSNP: 15.3% Ti/Tv: 2.67

Exome SNP calls on consensus target regions

Platform #Sample # SNP %dbSNP

b132

Known Ti/Tv merged / per-

sample

Novel Ti/Tv merged / per-

sample

Illumina+ SOLiD 1128

457,095 29.23% 3.47/3.41 3.05/2.97

SOLiD 306

244,736 42.05% 3.54 / 3.51 3.19/ 3.03

Illumina 822

348,599 35.94% 3.46/3.37 2.99/2.95

30

Effective Base

Depth

•Novel Effective Base Depth (EBD) summarization for each BAM

•High performance IO, small disk foot print (1~2GB per BAM)

SNP Site Discovery

•Novel variance ratio based site discovery statistics

•High sensitivity and specificity

Sequence Genotype Likelihood

•Novel BAM-specific binomial mixture modeling (BBMM)

•Capture BAM heterogeneity

Exist Genotype Integratio

n

•‘Dynamic linking’ of multiple exist genotype datasets with Bayesian style

•Improve both exist genotypes and sequence calls significantly

Genotype Imputatio

n

•Novel imputation engine

•High genotyping and phasing accuracy

Raw Sequence Reads (FASTQ)

Short Reads Mapping

Base Quality Recalibration

Binary sequence Alignment/Map Files (BAM)

Haplotype with Confidence Score (VCF)

Downstream Analysis

SNPTools pipeline overview

31

EBD file format

32

New algorithm for Genotype Likelihood

• Challenges in Raw Genotype Likelihood 1. Mapping/sequencing errors in site discovery

2. BAM heterogeneity, potential contamination

• Solutions 1. Novel concept of Effective Base Depth (EBD) to summarize

sequence details

2. BAM-specific binomial mixture model handles BAM heterogeneity

33

Rationale

• BAM-specific modeling – Using whole-genome VQSR

sites

– Perform 3-component BBMM on each BAM using Phase I VQSR (38M) SNPs sites

– High precision modeling with 38M data points!

– Make SNP array free QC on individual BAMs

1094

BAMs

39

M V

QSR

SNPs

site specific modeling

BAM specific modeling

small learning size BAM heterogeneity

low accuracy for alt/alt

huge learning size high accuracy for alt/alt

as one QC metric

aara,rr,=g

giigi )e,a+B(rw=)P(r

34

BBMM overcomes platform heterogeneity

35

SOLiD GL: BBMM better than Samtools

HM3

OMNI

Hyun Min Kang Univ Mich

36

Improvement of using BBMM GL also seen in Beagle

Hyun Min Kang Univ Mich

37

SNPTools Imputation – ‘Constraint Li-Stephens’

38

Phase I Genotypes: Chr1, Chr20 (released 2011-05-08)

call set OMNI HM3 Axiom

AA RA RR non-ref AA RA RR non-ref AA RA RR non-ref chr1 1.03 1.02 0.19 1.43 1.64 0.86 0.21 1.43 0.85 1.38 0.19 1.51

chr20 1.02 1.18 0.23 1.60 1.22 0.88 0.25 1.30 1.33 1.48 0.22 1.85 chr20 V4 1.33 1.21 0.37 2.02 1.20 0.83 0.25 1.26 1.36 1.45 0.21 1.83 chr20* 0.99 1.17 0.22 1.57 1.18 0.88 0.25 1.28 1.23 1.47 0.21 1.79

chr20 V4* 1.01 1.11 0.22 1.52 1.18 0.83 0.24 1.25 1.24 1.44 0.21 1.77

•chr1 and chr20 are based on new VQSR sites •chr20 V4 is based on old VQSR sites •chr20* and chr20 V4* are the overlapped sites between new VQSR and old VQSR

Chr20 genotype call set

Better OMNI concordance than V4 due to site/allele selection improvement

Similar accuracy on overlapped sites

Chr1 genotype call set

Slightly better than chr20 call set

39

Phasing accuracy evaluation

40

Integrating known array genotypes

raw genotype

probabilities

known genotypes

Direct re-weighting of overall

accuracy. Improvement is in

proportion to the number known

genotype integrated.

Imputation improvement of on-

array accuracy. Known

genotypes are treated as

99.98% confidence priors which

is still improvable.

Imputation improvement of off-

array accuracy. Make full use of

the LD between on and off

array sites.

sample

sites

Integrating LowPass + ExomeOffTarget

41

Exome off-target reads are evenly distributed

42

Exome off-target reads improve sensitivity

•~5% improved sensitivity in off targets

1000G NEW DEVELOPMENT & TIMELINE TO COMPLETION

44

1000 Genomes Project Design and Progress

• Pilot data collected in 2008; paper published October 2010 in Nature

– Companions in Science and Genome Research

– Other companions later

• Full project data collection and analysis underway – Phase 1 results published Nov 1st 2012

– Phase 2 / Phase 3 being completed

• Sequencing completion - early 2013 – Analysis completion in 2013-2014

46

1000G Phase 2/3 populations

ACB CDX

GHI KHV

PEL

CHD

GWD

MSL

ESN

PJL

BEB

STU

ITU

BCM BI1 LU SI1 UM BC BI2 OX1 OX2 SI2

05

00

015

00

0

BCM BI1 LU SI1 UM BC BI2 OX1 OX2 SI2

020

00

00

50

00

00

Overview of AFR Phase 2 Call Set Sizes (chr20)

47

SNPs

BCM BI1 LU SI1 UM BC BI2 OX1 OX2 SI2

02

00

00

40

00

0

Indels/ Cplxsubs

MNPs

195K

511K 491K 480K 481K

362K 460K 452K

252K

0

17K

0 0

48K 42K 42K 44K

49K 46K

28K

0 0 0 0 0 4K

8K

19K

3K 206

Alignment-based Call Sets Assembly-based Call Sets

Adrian Tan, Hyun Min Kang

A time-line

• Data generation (incl, LC, exome, CG, SNP arrays) by end March.

• Final alignment index from DCC by start June.

• Contributing call sets (SNP, indel, MNP, complex, SV) by end July

• Consensus and resolved site list with GLs by end August

• Integrated haplotypes by ASHG 2013

Gil McVean

49 49

Acknowledgements

BCM-HGSC

• Yi Wang: SNPTOOLS

• Jin Yu: Atlas-SNP

• Danny Challis: Atlas-INDEL

• Uday Evani: VCFPRINTER

• Matthew Bainbridge

• Donna Muzny

• Jeffrey Reid

• Richard Gibbs

• Gabor Marth

• Amit Indap

• Wen Fung Leong

• Alistair Ward

Boston College

Broad Institute

• Mark DePristo

• Ryan Poplin

• Eric Banks

• Simon Gravel

• Carlos Bustamante

Stanford University

Univ of Michigan

• Goncalo Abecasis

• Hyun Min Kang

BCM-BRL

• Andrew R. Jackson

• Sameer Paithankar

• Cristian Coarfa

• Aleksandar Milosavljevic

BlueBioU@Rice University

• Kim Andrews

• Roger Moye

• Chandler Wilkerson

50

Postdoc positions available

Contact

Fuli Yu

fyu@bcm.edu

Recommended