42
he Human Reference Assembl Deanna M. Church Staff Scientist, NCBI @deannachurch Short Course in Medical Genetics Updating the assembly

The Human Reference Assembly

  • Upload
    quant

  • View
    57

  • Download
    0

Embed Size (px)

DESCRIPTION

The Human Reference Assembly. Updating the assembly. Deanna M. Church Staff Scientist, NCBI. Short Course in Medical Genetics 2013. @ deannachurch. Updating the assembly. Oh No! Not a new version of the human genome!. Updating the assembly. GRCh37.p13 (160 regions: >3% of chromosomes). - PowerPoint PPT Presentation

Citation preview

Page 1: The Human Reference Assembly

The Human Reference AssemblyDeanna M. Church Staff Scientist, NCBI

@deannachurch Short Course in Medical Genetics 2013

Updating the assembly

Page 2: The Human Reference Assembly

Oh No! Not a new version of the human genome!

Updating the assembly

Page 3: The Human Reference Assembly

Updating the assembly

Page 4: The Human Reference Assembly

120 Fix PATCHES: Chromosome update in GRCh38

71 Novel PATCHES: Additional sequence added

(adds >5 Mb of novel sequence to the assembly)

(adds >800K of novel sequence to the assembly)

Releasing patches quarterly

GRCh37.p13(160 regions: >3% of chromosomes)

Summer of 2013

Page 5: The Human Reference Assembly

Assembly (e.g. GRCh37.p5)

Primary Assembly

Non-nuclear assembly unit

(e.g. MT)

ALT 1

ALT 2

ALT 3

ALT 4

ALT 5

ALT 9

ALT 6

ALT 7ALT

8

PAR

Genomic Region(MHC)

Genomic Region

(UGT2B17)Genomic

Region(MAPT)

Patches

Genomic Region(ABO)

Genomic Region(SMA)

Genomic Region

(PECAM1)

Data Model

Page 6: The Human Reference Assembly

GCA_000001405.6 /GCF_000001405.17

Primary Assembly

GCA_000001305.1/GCF_000001305.13

ALT 1

GCA_000001315.1/GCF_000001315.1

ALT 2

GCA_000001325.1/GCF_000001325.2

ALT 3

GCA_000001335.1/GCF_000001335.1

ALT 4

GCA_000001345.1/GCF_000001345.1

ALT 5

GCA_000001355.1/GCF_000001355.1

ALT 6

GCA_000001365.1/GCF_000001365.2

ALT 7

GCA_000001375.1/GCF_000001375.1

ALT 8

GCA_000001385.1/GCF_000001385.1

ALT 9

GCA_000001395.1/GCF_000001395.1

Patches GCA_000005045.5GCF_000005045.4

Non-nuclear assembly unit

(e.g. MT)

GCA_000006015.1/GCF_000006015.1

Data Model

Page 7: The Human Reference Assembly

GRCh38 is coming(September, 2013)

Page 8: The Human Reference Assembly
Page 9: The Human Reference Assembly

http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes

Page 10: The Human Reference Assembly
Page 11: The Human Reference Assembly
Page 12: The Human Reference Assembly

http://genomereference.org

Page 13: The Human Reference Assembly

Why does missing sequence matter?

GRCh37

Sample genomeDuplicon A Duplicon B

Duplicon A

May or may not detect increased coverage depending on sequencing depthand library quality (easier to find with new technologies than with old, low through technologies)

x x

G>A (allelic difference – true variant)

G>C (paralogous sequence variant- false positive)

Page 14: The Human Reference Assembly

http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes

CDC27

1KG Phase 1 Strict accessibility mask

SNP (all)

SNP (not 1KG)

Page 15: The Human Reference Assembly

Sudmant et al., 2010

Page 16: The Human Reference Assembly

Kidd et al, 2007 APOBEC cluster

Part of chr22 assembly

Alternate locus for chr22

White: InsertionBlack: Deletion

Page 17: The Human Reference Assembly

http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes

Page 18: The Human Reference Assembly

Mouse Ren1 chr1 (CM000994.2/NC_000067.6): 133350674-133360320

129S6/SVEvTac tiling path

Alignment to C57BL/6J chr1

B6 Genes

129S6/SvEvTac Genes

+ 32Kb in 129S6/SvEvTac

Page 19: The Human Reference Assembly

Mouse Ren1 chr1 (CM000994.2/NC_000067.6): 133350674-133360320NM_031192.3: transcript from C57BL/6JNM_031193.2: transcript from FVB/N

129S6/SvEvTac Alt Locus Alignment (allelic)

FVB/N Transcript Alignment (paralog)

Page 20: The Human Reference Assembly

129S6/SvEvTac Ren1

FVB Ren2 Tx

Paralogousdiff

SNP +Paralogous

diff

Mouse Ren1 chr1 (CM000994.2/NC_000067.6): 133350674-133360320NM_031192.3: transcript from C57BL/6JNM_031193.2: transcript from FVB/N

Page 21: The Human Reference Assembly

Hydin: chr16 (16q22.2)Hydin2: chr1 (1q21.1)Missing in NCBI35/NCBI36 Unlocalized in GRCh37 Finished in GRCh38

Alignment to Hydin2 Genomic, 300 Kb, 99.4% ID

Alignment to Hydin1 CHM1_1.0, >99.9% ID

Alignment to Hydin2 Genomic, 300 Kb, 99.4% ID

Alignment to Hydin1 CHM1_1.0, >99.9% ID

Doggett et al., 2006

Page 22: The Human Reference Assembly

Dennis et al., 2012

1q32 1q21 1p21

1p21 patch alignment to chromosome 1

Page 23: The Human Reference Assembly

Preview of GRCh38 (scheduled Fall 2013)

TEX28 TKTL1

LOC101060233(opsin related)

LOC101060234(TEX28 related)

GRCh37 (current reference assembly)chrX

Page 24: The Human Reference Assembly

http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-21

NCBI36 (hg18)

GRCh

37 (h

g19)

Page 25: The Human Reference Assembly

NCBI35 (hg17)

GRCh37 (hg19)

AL139246.20

AL139246.21

Page 26: The Human Reference Assembly

Fixing Rare/Incorrect Bases

Page 27: The Human Reference Assembly

Fixing Rare/Incorrect Bases

Page 28: The Human Reference Assembly

GRCh37B Sites for Update: n=1164Sites with unique successful ctg 1148 (98.6%)Avg Length 448 bpMin/Max Success Length 51/791 bpAvg Coverage 80x

Read Source (all contigs)High coverage 32%Low coverage 57%Exome 10%

Fixing Rare/Incorrect Bases

Page 29: The Human Reference Assembly

A = 0.000G=1.000

rs4732519

Page 30: The Human Reference Assembly

rs4732519

RP11 WGS reads

Private RP11 variant?Missing in 1000G?

Page 31: The Human Reference Assembly

FAM23_MRC1 Region, chr10

Segmental Duplications

1KG accessibility Mask

Novel Patch 250 kb of artificial duplication

Page 32: The Human Reference Assembly

Genovese et al., 2013

Page 33: The Human Reference Assembly

Adding Novel Sequence

Page 34: The Human Reference Assembly

Adding Novel Sequence

Karen Hayden and Jim Kent

Page 35: The Human Reference Assembly

Human Resolved for GRCh38

http://genomereference.org

Page 36: The Human Reference Assembly

Richa Agarwala

MHC Alternate locus

Alignment to chr6

Page 37: The Human Reference Assembly
Page 38: The Human Reference Assembly

Making the assembly accessible to existing tools: masking

Query set: 439,109,084 NA12878 HiSeq reads

Page 39: The Human Reference Assembly

Masking effectively blocks alignments in regions with high identity

Simulated reads from GRCh37.p9• Unpaired reads• 101 bp• 1x coverage• Default wgsim parameters

Masking parameters• Percent Id: 100%• Step size: 5 bp• Minimum length: 101 bp• Center SNPs in unmasked regions

Page 40: The Human Reference Assembly

Masking improves alignments in regions with alternate loci or patches

Page 41: The Human Reference Assembly

NA12878 reads whose best alignment was on an alt/patch in the masked assembly were evaluated for their alignment location when aligned to the primary assembly alone

Masking effectively reduces the increase in NA12878 reads that have alignments with MAPQ=0 that occurs when the full assembly is used as an alignment substrate

Page 42: The Human Reference Assembly

Take home messages

The assembly you use for analysis is an important part ofyour analysis package. The reference assembly is not a set of linear sequences butcan now represent allelic diversity

Tools still need to catch up. The human reference assembly is updating soon!(Remember: assemblies are not static if you are lucky!)